# **IV Feature Engineering**

## **IV.1 Import Libraries and Load Data**

In [38]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

awards_players_cleaned = pd.read_csv('../data/basketballPlayoffs_cleaned/awards_players_cleaned.csv')
coaches_cleaned = pd.read_csv('../data/basketballPlayoffs_cleaned/coaches_cleaned.csv')
players_cleaned = pd.read_csv('../data/basketballPlayoffs_cleaned/players_cleaned.csv')
players_teams_cleaned = pd.read_csv('../data/basketballPlayoffs_cleaned/players_teams_cleaned.csv')
series_post_cleaned = pd.read_csv('../data/basketballPlayoffs_cleaned/series_post_cleaned.csv')
teams_cleaned = pd.read_csv('../data/basketballPlayoffs_cleaned/teams_cleaned.csv')
teams_post_cleaned = pd.read_csv('../data/basketballPlayoffs_cleaned/teams_post_cleaned.csv')

## **IV.2 Players Overall Calculation**
### **IV.2.1 Stamina Overall Calculation**

In this section, we calculate the "Stamina" overall for each player. Stamina is a composite metric based on the number of games played and the total minutes played. The formula combines both game participation and total playing time, weighted differently, to create a stamina score on a scale of 1 to 10.

The formula applies a weight of 20% to games played and 80% to minutes played, and normalizes the values by the mean number of games and minutes across all players.

In [39]:
def calculate_stamina(df):
    total_games = df['GP'].sum() + df['PostGP'].sum()
    total_minutes = df['minutes'].sum() + df['PostMinutes'].sum()

    mean_games = total_games / len(df)
    mean_minutes = total_minutes / len(df)

    overall_stamina = (
        (0.2 * (df['GP'] + df['PostGP']) / mean_games) + 
        (0.8 * (df['minutes'] + df['PostMinutes']) / mean_minutes)
    )

    overall_stamina = np.clip(overall_stamina * 5, 1, 10)
    return overall_stamina.round(1)

players_teams_cleaned['overallSTAMINA'] = calculate_stamina(players_teams_cleaned)


### **IV.2.2 Defense Overall Calculation**

This section calculates the "Defense" overall for each player based on their defensive performance metrics. The calculation considers a variety of factors including defensive rebounds, steals, blocks, turnovers, personal fouls (PF), and disqualifications (DQ). The formula is designed to reward players for positive defensive actions while penalizing them for turnovers, fouls, and disqualifications.

The formula applies different weights to each factor based on its importance in defense:
- 45% weight on defensive rebounds
- 20% weight on steals and blocks
- -5% penalty for turnovers, fouls, and disqualifications

Each player's defensive score is then scaled to a range between 1 and 10.


In [40]:
def calculate_overall_defense(df):
    total_games = df['GP'].sum() + df['PostGP'].sum()

    mean_drebounds = (df['dRebounds'].sum() + df['PostdRebounds'].sum()) / total_games if total_games > 0 else 1
    mean_steals = (df['steals'].sum() + df['PostSteals'].sum()) / total_games if total_games > 0 else 1
    mean_blocks = (df['blocks'].sum() + df['PostBlocks'].sum()) / total_games if total_games > 0 else 1

    mean_turnovers = (df['turnovers'].sum() + df['PostTurnovers'].sum()) / total_games if total_games > 0 else 1
    mean_pf = (df['PF'].sum() + df['PostPF'].sum()) / total_games if total_games > 0 else 1
    mean_dq = (df['dq'].sum() + df['PostDQ'].sum())  / total_games if total_games > 0 else 0 

    player_drebounds = (df['dRebounds'] + df['PostdRebounds']) / (df['GP'] + df['PostGP'])
    player_steals = (df['steals'] + df['PostSteals']) / (df['GP'] + df['PostGP'])
    player_blocks = (df['blocks'] + df['PostBlocks']) / (df['GP'] + df['PostGP'])

    player_turnovers = (df['turnovers'] + df['PostTurnovers']) / (df['GP'] + df['PostGP'])
    player_pf = (df['PF'] + df['PostPF']) / (df['GP'] + df['PostGP'])
    player_dq = (df['dq'] + df['PostDQ']) / (df['GP'] + df['PostGP'])

    overall_defense = (
        (0.45 * (player_drebounds / mean_drebounds)) +
        (0.2 * (player_steals / mean_steals)) + 
        (0.2 * (player_blocks / mean_blocks)) - 
        (0.05 * (player_turnovers / mean_turnovers)) - 
        (0.05 * (player_pf / mean_pf)) - 
        (0.05 * (player_dq / mean_dq))      
    )
    
    overall_defense = np.clip(overall_defense * 5, 1, 10)
    return overall_defense.round(1)

players_teams_cleaned['overallDEFENSE'] = calculate_overall_defense(players_teams_cleaned)

### **IV.2.3 Ofense Overall Calculation**

This section calculates the "Offense" overall for each player based on their offensive performance metrics. The calculation considers various factors such as points scored, assists, field goals made, free throws made, three-pointers made, and offensive rebounds. The formula applies different weights to each factor to determine a player's offensive contribution.

The formula assigns different weights to each factor:
- 40% weight on points
- 25% weight on assists
- 10% weight on field goals made, free throws made, and three-pointers made
- 5% weight on offensive rebounds

Each player's offensive score is then scaled to a range between 1 and 10.

In [41]:
def calculate_overall_offense(df):
    total_games = df['GP'].sum() + df['PostGP'].sum()

    mean_points = (df['points'].sum() + df['PostPoints'].sum()) / total_games if total_games > 0 else 1
    mean_assists = (df['assists'].sum() + df['PostAssists'].sum()) / total_games if total_games > 0 else 1
    mean_fgMade = (df['fgMade'].sum() + df['PostfgMade'].sum()) / total_games if total_games > 0 else 1
    mean_ftMade = (df['ftMade'].sum() + df['PostftMade'].sum()) / total_games if total_games > 0 else 1
    mean_threeMade = (df['threeMade'].sum() + df['PostthreeMade'].sum()) / total_games if total_games > 0 else 1

    mean_orebounds = (df['oRebounds'].sum() + df['PostoRebounds'].sum()) / total_games if total_games > 0 else 1

    player_points = (df['points'] + df['PostPoints']) / (df['GP'] + df['PostGP'])
    player_assists = (df['assists'] + df['PostAssists']) / (df['GP'] + df['PostGP'])
    player_fgMade = (df['fgMade'] + df['PostfgMade']) / (df['GP'] + df['PostGP'])
    player_ftMade = (df['ftMade'] + df['PostftMade']) / (df['GP'] + df['PostGP'])
    player_threeMade = (df['threeMade'] + df['PostthreeMade']) / (df['GP'] + df['PostGP'])

    player_orebounds = (df['oRebounds'] + df['PostoRebounds']) / (df['GP'] + df['PostGP'])

    overall_offense = (
        (0.4 * (player_points / mean_points)) +
        (0.25 * (player_assists / mean_assists)) +
        (0.1 * (player_fgMade / mean_fgMade)) +
        (0.1 * (player_ftMade / mean_ftMade)) +
        (0.1 * (player_threeMade / mean_threeMade)) +
        (0.05 * (player_orebounds / mean_orebounds))
    )

    overall_offense = np.clip(overall_offense * 5, 1, 10)
    return overall_offense.round(1)

players_teams_cleaned['overallOFFENSE'] = calculate_overall_offense(players_teams_cleaned)

### **IV.2.4 Overall Combined Calculation**

This section calculates a player's combined overall performance by merging the individual ratings for stamina, defense, and offense. The final score is a weighted sum of the three individual metrics, with defense and offense having higher importance than stamina.

The combined overall score is then clipped to a range of 1 to 10 to standardize the results. Finally, the combined overall score is rounded to one decimal place for easier interpretation.


In [42]:
def prepare_player_data(players_teams_df, players_df, awards_df):
    player_height = players_df[['bioID', 'height']].copy()
    player_height.columns = ['playerID', 'height']
    
    player_height['height_normalized'] = (player_height['height'] - player_height['height'].min()) / \
                                       (player_height['height'].max() - player_height['height'].min())
    
    awards_count = awards_df.groupby('playerID').size().reset_index(name='award_count')
    
    max_awards = awards_count['award_count'].max()
    awards_count['awards_normalized'] = awards_count['award_count'] / max_awards if max_awards > 0 else 0
    
    return player_height, awards_count

def calculate_enhanced_overall(df, player_height_df, awards_df):
    base_overall = (
        (0.18 * df['overallSTAMINA']) +  # Reduced from 0.2
        (0.37 * df['overallDEFENSE']) +  # Reduced from 0.4
        (0.37 * df['overallOFFENSE'])    # Reduced from 0.4
    )
    
    df = df.merge(player_height_df[['playerID', 'height_normalized']], 
                 on='playerID', 
                 how='left')
    
    df = df.merge(awards_df[['playerID', 'awards_normalized']], 
                 on='playerID', 
                 how='left')
    
    df['height_normalized'] = df['height_normalized'].fillna(0)
    df['awards_normalized'] = df['awards_normalized'].fillna(0)
    
    enhanced_overall = base_overall + \
                      (0.04 * df['height_normalized'] * 10) + \
                      (0.04 * df['awards_normalized'] * 10)
    
    enhanced_overall = np.clip(enhanced_overall, 1, 10)
    
    percentile_95 = np.percentile(enhanced_overall, 95)
    enhanced_overall = np.where(
        enhanced_overall > percentile_95,
        percentile_95 + (enhanced_overall - percentile_95) * 0.5,
        enhanced_overall
    )
    
    return enhanced_overall.round(1)

player_height_df, awards_count_df = prepare_player_data(
    players_teams_cleaned, 
    players_cleaned, 
    awards_players_cleaned
)

players_teams_cleaned['OVERALL'] = calculate_enhanced_overall(
    players_teams_cleaned,
    player_height_df,
    awards_count_df
)

players_teams_cleaned.to_csv('../data/basketballPlayoffs_cleaned/players_teams_cleaned.csv', index=False)

players_overall_avg = players_teams_cleaned.groupby('playerID')['OVERALL'].mean().reset_index()
players_overall_avg.rename(columns={'OVERALL': 'OVERALL_ALL_TIME'}, inplace=True)
players_overall_avg['OVERALL_ALL_TIME'] = players_overall_avg['OVERALL_ALL_TIME'].round(1)
players_overall_avg.to_csv('../data/basketballPlayoffs_cleaned/players_overall_all_time.csv', index=False)

### **IV.2.5 Export Updated DataFrame with overalls**

After calculating the combined overall scores for each player, the updated DataFrame (`players_teams_cleaned`) is exported to a CSV file.

In [43]:
players_teams_cleaned.to_csv('../data/basketballPlayoffs_cleaned/players_teams_cleaned.csv', index=False)

### **IV.2.6 All time Overall**
In this step, the average overall score for each player across all seasons is calculated. This represents each player's performance over time. The DataFrame is then saved to a new CSV file.

In [44]:
players_overall_avg = players_teams_cleaned.groupby('playerID')['OVERALL'].mean().reset_index()

players_overall_avg.rename(columns={'OVERALL': 'OVERALL_ALL_TIME'}, inplace=True)

players_overall_avg['OVERALL_ALL_TIME'] = players_overall_avg['OVERALL_ALL_TIME'].round(1)

players_overall_avg.to_csv('../data/basketballPlayoffs_cleaned/players_overall_all_time.csv', index=False)

## **IV.3 Rookies Average Overall Calculation**

In this step, we calculate the average overall score for rookie players. A rookie is identified by the earliest year recorded in the dataset for each player. The code then filters out rookie players and calculates their average overall score, which is saved in a new CSV file for future use.

In [45]:
rookie_year = players_teams_cleaned.groupby('playerID')['year'].min()

players_teams_cleaned['is_rookie'] = players_teams_cleaned.apply(
    lambda row: 1 if row['year'] == rookie_year[row['playerID']] else 0,
    axis=1
)

players_teams_cleaned.to_csv('../data/basketballPlayoffs_cleaned/players_teams_cleaned.csv', index=False)

rookie_players = players_teams_cleaned[players_teams_cleaned['is_rookie'] == 1]

rookie_overall_avg = rookie_players['OVERALL'].mean().round(1)

rookie_overall_avg_df = pd.DataFrame({'rookie_overall_avg': [rookie_overall_avg]})

rookie_overall_avg_df.to_csv('../data/basketballPlayoffs_cleaned/rookie_overall_avg.csv', index=False)

## **IV.4 Coaches Overall Calculation**

### **IV4.1 Coach Overall**

This function calculates an overall performance rating for each coach based on their win/loss record during regular and post-seasons.

The rating is then added as a new column `OVERALL` to the `coaches_cleaned` dataset and saved as a CSV file.

In [46]:
def calculate_overall_coach(df):
    total_wins = df['won'].sum() + df['post_wins'].sum()
    total_losses = df['lost'].sum() + df['post_losses'].sum()
    
    total_games = total_wins + total_losses

    win_percentage = total_wins / total_games if total_games > 0 else 0

    total_wins_per_coach = df['won'] + df['post_wins']
    total_losses_per_coach = df['lost'] + df['post_losses']
    
    coach_win_percentage = total_wins_per_coach / (total_wins_per_coach + total_losses_per_coach)

    relative_performance = coach_win_percentage / win_percentage 
    
    overall = np.clip(relative_performance * 6, 1, 10) 
    return overall.round(1)

coaches_cleaned['OVERALL'] = calculate_overall_coach(coaches_cleaned)

coaches_cleaned.to_csv('../data/basketballPlayoffs_cleaned/coaches_cleaned.csv', index=False)

### **IV4.2 All time Coach Overall**
In this step, the average overall score for each coach across all seasons is calculated. This represents each coach performance over time. The DataFrame is then saved to a new CSV file.

In [47]:
coaches_cleaned = pd.read_csv('../data/basketballPlayoffs_cleaned/coaches_cleaned.csv')

coaches_overall_avg = coaches_cleaned.groupby('coachID')['OVERALL'].mean().reset_index()

coaches_overall_avg.rename(columns={'OVERALL': 'OVERALL_ALL_TIME'}, inplace=True)

coaches_overall_avg['OVERALL_ALL_TIME'] = coaches_overall_avg['OVERALL_ALL_TIME'].round(1)

coaches_overall_avg.to_csv('../data/basketballPlayoffs_cleaned/coaches_overall_all_time.csv', index=False)

### **IV4.3 Coach Rookies Average Overall**

In this step, we calculate the average overall score for rookie coaches. A rookie is identified by the earliest year recorded in the dataset for each coach. The code then filters out rookie coaches and calculates their average overall score, which is saved in a new CSV file for future use.

In [48]:
rookie_year = coaches_cleaned.groupby('coachID')['year'].min()

coaches_cleaned['is_rookie'] = coaches_cleaned.apply(
    lambda row: 1 if row['year'] == rookie_year[row['coachID']] else 0,
    axis=1
)

rookie_coaches = coaches_cleaned[coaches_cleaned['is_rookie'] == 1]

rookie_overall_avg = rookie_coaches['OVERALL'].mean().round(1)

rookie_overall_avg_df = pd.DataFrame({'rookie_overall_avg': [rookie_overall_avg]})

rookie_overall_avg_df.to_csv('../data/basketballPlayoffs_cleaned/rookie_overall_coaches_avg.csv', index=False)

coaches_cleaned.to_csv('../data/basketballPlayoffs_cleaned/coaches_cleaned.csv', index=False)
