# Deep Inspection
In this phase, we analyze the two tables in more detail to eliminate some inconsistent data.

We will prepare the data for the model. Having done the data cleaning and the integration before, we will now prepare the data for the model. 

Following the findings reported in the previous notebook, here we tried to adapt the data from being 'raw' to cleaner, discarding those that we did not consider relevant. 

In the file there is first a deep inspection and then the actual transformation of the data.


Here our main actions about data preparation:
- Players who did not participate in any games during the years covered by the dataset were removed.
- Several columns contained null values that were not initially recognized as such. Dates formatted as 00-00-00, integers with a value of 0, and empty strings were all converted to None.
- Normalized attributes that followed a Normalized Distribution. Standardize attributes that didn’t, linearly from 0 to 1
- We added up all the scores made by the individual players per team and replaced this result with the value of the points of the individual teams
- We created new specific attributes, for Teams: Win rates, Total Player Insights (TPI), Numbers of Post Season’s match
- We created attributes for the average weight and height of each team based on the players' stats, and we calculated the players' ages
- Some attributes were merged into new ones or converted in numerical format
- We divided the table into East and West to conduct a targeted study for each individual conference, and then we merged everything back together
- Last thing was adding a new column, called PlayOffNextYear in order to save, for each team, if they were qualified for the playoff of the next year 

In [228]:
import numpy as np
import pandas as pd
import transformation_utils as util
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.preprocessing import StandardScaler


This script processes NBA player statistics to update team-level data for the Eastern and Western conferences. 

First, it begins by loading player data and defines mappings to aggregate individual stats (e.g., field goals made or rebounds) into team-level metrics. 

For each conference, it updates the team files by aggregating the relevant player statistics and saving the results. Once team-level data is updated, redundant player-specific columns are removed from the team files, ensuring the final datasets contain only the necessary aggregated team statistics.

A relevant discovery was that there is an inconsistency between the team's total points for a year and the sum of the individual players' points for that team in that year. We decided to consider the sum of the player stats for each team for every year.


In [229]:
df_players = pd.read_csv('../newData/final_players_teams.csv')

stat_mappings = [
    ('fgMade', 'o_fgm'), ('ftMade', 'o_ftm'), ('threeMade', 'o_3pm'),
    ('fgAttempted', 'o_fga'), ('ftAttempted', 'o_fta'), ('threeAttempted', 'o_3pa'),
    ('oRebounds', 'o_oreb'), ('dRebounds', 'o_dreb'), ('rebounds', 'o_reb'),
    ('assists', 'o_asts'), ('steals', 'o_stl'), ('turnovers', 'o_to'), ('blocks', 'o_blk')
]

for side in ['EA', 'WE']:
    for player_stat, team_stat in stat_mappings:
        util.update_team_data(f'../newData/teams_{side}_cleaned.csv', df_players, player_stat, team_stat,
                         f'../newData/teams_{side}_cleaned.csv')

columns_to_remove = [
    'fgMade', 'ftMade', 'threeMade', 'fgAttempted', 'ftAttempted', 'threeAttempted',
    'oRebounds', 'dRebounds', 'rebounds', 'assists', 'steals', 'turnovers', 'blocks'
]

for path in ['../newData/teams_EA_cleaned.csv', '../newData/teams_WE_cleaned.csv']:
    df_teams_final = pd.read_csv(path)
    df_teams_final = df_teams_final.drop(columns=columns_to_remove, errors='ignore')
    df_teams_final.to_csv(path, index=False)

Mismatches found for o_fgm:
     tmID  year  o_fgm  sum_fgMadePlayer  diff_fgMade
0     ATL     9  895.0        818.000000    77.000000
1     ATL     9  895.0        818.000000    77.000000
2     ATL     9  895.0        818.000000    77.000000
3     ATL     9  895.0        818.000000    77.000000
4     ATL     9  895.0        818.000000    77.000000
...   ...   ...    ...               ...          ...
1136  WAS    11  870.4        997.181481  -126.781481
1137  WAS    11  870.4        997.181481  -126.781481
1138  WAS    11  870.4        997.181481  -126.781481
1139  WAS    11  870.4        997.181481  -126.781481
1140  WAS    11  870.4        997.181481  -126.781481

[167 rows x 5 columns]
Mismatches found for o_ftm:
     tmID  year  o_ftm  sum_ftMadePlayer  diff_ftMade
0     ATL     9  542.0        476.000000    66.000000
1     ATL     9  542.0        476.000000    66.000000
2     ATL     9  542.0        476.000000    66.000000
3     ATL     9  542.0        476.000000    66.000000
4 

Another relevant element is that two teams (one from the East and one from the West) changed their names starting from year 4. 

We decided to update the `tmID` and the team name to the most recent ones for years 1, 2, and 3.

In [230]:
teams_EA = pd.read_csv('../newData/teams_EA_cleaned.csv')
teams_EA.loc[teams_EA['tmID'] == 'ORL', 'tmID'] = 'CON'
teams_EA.loc[teams_EA['tmID'] == 'CON', 'name'] = 'Connecticut Sun'
teams_EA.to_csv('../newData/teams_EA_cleaned.csv', index=False)

teams_WE = pd.read_csv('../newData/teams_WE_cleaned.csv')
teams_WE.loc[teams_WE['tmID'] == 'UTA', 'tmID'] = 'SAS'
teams_WE.loc[teams_WE['tmID'] == 'SAS', 'name'] = 'San Antonio Silver Stars'
teams_WE.to_csv('../newData/teams_WE_cleaned.csv', index=False)


Subsequently, we merged the two tables.



In [231]:
df_ea = pd.read_csv('../newData/teams_EA_cleaned.csv')
df_we = pd.read_csv('../newData/teams_WE_cleaned.csv')

combined_df = pd.concat([df_ea, df_we], ignore_index=True)
combined_df.to_csv('../newData/combined_teams.csv', index=False)

Part of this phase was also to create and modify data that was already inside the datasets but which needed to be adapted to the model. 

We created a variable called Total Player Insights (TPI) which contains within it all the weights relating to the various data of each player. We introduced also the `TPI_sum` for the teams and then the columns used to calculate the TPI were removed.

It was also introduced the win rate (`winrate`) for each team.

In [232]:
import pandas as pd

TPI_weights = {
    'o_pts': 2,       
    'o_fgm': 1,       
    'o_3pm': 2,       
    'o_ftm': 1,       
    'o_reb': 1,       
    'o_asts': 1,     
    'o_to': -2,       
    'o_pf': -1,       
    'd_reb': -1,       
    'd_stl': -1,       
    'd_blk': 1,      
    'd_pts': -1,      
    'd_pf': -1,       
    'd_to': 2       
}

df = pd.read_csv('../newData/combined_teams.csv')

df['TPI_Sum'] = 0.0
for index, row in df.iterrows():
    TPI_sum = sum(row[stat] * weight for stat, weight in TPI_weights.items() if stat in row)
    df.at[index, 'TPI_Sum'] = TPI_sum
    
columns_to_remove = list(TPI_weights.keys())
df.drop(columns=columns_to_remove, inplace=True)

df['winrate'] = df['won_x'] / df['GP_x'] * 100

# Transformation

### Eliminating Useless Attributes
in this section we decided which attributes weren't relevant and could be deleted.

In [233]:
df.drop(
    [
        "franchID",
        "won_x",
        "lost_x",
        "homeW",
        "homeL",
        "awayW",
        "awayL",
        "name",
        "confW",
        "confL",
        "min",
        "attend",
        "arena",
        "GP_y",
        "GP_x",
        "stint_x",
        "points",
        "PF",
        "GS",
        "minutes",
        "dq",
        "PostGP",
        "PostGS",
        "GS",
        "stint_y",
        "won_y",
        "lost_y",
    ],
    axis=1, inplace=True,
)

## Preparing Data


In the following section, the code processes NBA team data to handle playoff-related information across multiple years. 

First, it begins by standardizing key columns (`playoff`, `firstRound`, `semis`, `finals`) using a mapping that converts values like `Y`, `L`, and `W` to `1` and `N` to `0`.
 
Next, a new DataFrame is created with one row per team per year, containing unique values for the columns of interest. For each team, we calculated the average playoff-related stats for years before 11 and we filled any missing values for year 11 with these averages. 

Additionally, we updated the `playoff` column for year 11 by copying the value from year 10 or assigning `0` if unavailable. 

Finally, a new column, `roundsPlayed`, is created as the sum of `firstRound`, `semis`, and `finals`, and these intermediate columns are dropped to simplify the dataset. 

This ensures the data is the cleaner, more consistent, and more ready possible for further analysis.

In [234]:
columns_of_interest = ['playoff', 'firstRound', 'semis', 'finals']

mapping = {'Y': 1, 'L': 1, 'W': 1, 'N': 0}
df[columns_of_interest] = df[columns_of_interest].replace(mapping)

unique_per_year = df.groupby(['year', 'tmID'])[columns_of_interest].first().reset_index()

teams = df['tmID'].unique()
for team in teams:
    team_data_past = unique_per_year[(unique_per_year['tmID'] == team) & (unique_per_year['year'] < 11)]
    team_means = team_data_past[columns_of_interest].mean()
    df.loc[(df['tmID'] == team) & (df['year'] == 11), columns_of_interest] = \
        df.loc[(df['tmID'] == team) & (df['year'] == 11), columns_of_interest].fillna(team_means)

for team in df['tmID'].unique():
    playoff_year_10 = df.loc[(df['tmID'] == team) & (df['year'] == 10), 'playoff']

    if not playoff_year_10.empty:
        value_to_copy = playoff_year_10.values[0]
    else:
        value_to_copy = 0
    
    df.loc[(df['tmID'] == team) & (df['year'] == 11), 'playoff'] = value_to_copy


df["roundsPlayed"] = df[["semis", "finals", "firstRound"]].sum(axis=1)

df.drop(["semis", "finals", "firstRound"], axis=1, inplace=True)

  df[columns_of_interest] = df[columns_of_interest].replace(mapping)


Afterward, we modified the dataset to reassign a team and reset its statistics. Specifically, we updated the team `DET` in year 11, changing its identifier to `TUL`, assigning it to the Western Conference (`WE`), and setting its `playoff` participation and `roundsPlayed` to `0`. 

Additionally, for the newly created team `TUL`, all specified performance and statistical columns (e.g., offensive and defensive stats, postseason results, awards, win rate, and TPI score) are set to `0`, ensuring the team starts with a clean slate in the dataset.

In [235]:
df.loc[(df['tmID'] == 'DET') & (df['year'] == 11), ['tmID', 'confID','playoff', 'roundsPlayed']] = ['TUL', 'WE', 0, 0]

columns_to_zero = [
    'o_fga', 'o_fta', 'o_3pa', 'o_oreb', 'o_dreb', 'o_stl', 'o_blk',
    'd_fgm', 'd_fga', 'd_ftm', 'd_fta', 'd_3pm', 'd_3pa', 'd_oreb',
    'd_dreb', 'd_asts', 'post_wins', 'post_losses',	'award_y',	'winrate',	'TPI_Sum'
]
df.loc[df['tmID'] == 'TUL', columns_to_zero] = 0

This section creates a summarized team-level dataset by iterating over each team and year combination in the original dataset. 

For each team-year, it extracts relevant data, fills missing values with `0`, and computes additional statistics, such as the count of player awards (`award_player`), whether the coach won an award (`award_coach`), the average height and weight of players, and the team's average player age based on birthdates. 

It then discards unnecessary columns and award-related columns before appending the processed data to a new DataFrame, `new_df`. 

Finally, the script sorts the data by year and team, resulting in a clean and aggregated dataset optimized for team-level analysis.

In [236]:
new_df = pd.DataFrame()
for year in df["year"].unique():
    for team in df["tmID"].unique():
        small_df = df[(df["tmID"] == team) & (df["year"] == year)]
        if small_df.empty:
            continue

        d = pd.DataFrame([small_df.iloc[0]])
        d.fillna(0, inplace=True)
        d["award_player"] = small_df["award_x"].count()
        d["award_coach"] = d["award_y"].apply(lambda i: 1 if i != 0 else 0)
        d["height"] = small_df["height"].mean()
        d["weight"] = small_df["weight"].mean()
        d["playersAge"] = (2000 + df["year"]) - util.get_overall_age(
            small_df["birthDate"]
        )
        d.drop("playerID", axis=1, inplace=True)
        d.drop("birthDate", axis=1, inplace=True)
        d.drop("award_x", axis=1, inplace=True)
        d.drop("award_y", axis=1, inplace=True)
        d.drop("coachID", axis=1, inplace=True)

        new_df = pd.concat([new_df, d])

df = new_df.sort_values(by=["year", "tmID"])

# Future Encoding

This section of the code uses LabelEncoder to transform categorical variables into numerical values and separates the columns based on the type of variable

In [237]:
le = LabelEncoder()
df["confID"] = le.fit_transform(df["confID"])

key_cols = ["confID", "year", "playoff", "tmID"]

numerical_cols = [col for col in df.columns if col not in key_cols]

## Scaling of Numerical Variables
This section of code scales numerical variables in the dataset using a custom approach based on their distribution. 

It identifies Gaussian-like columns (with skewness < 0.5) and scales them using `StandardScaler`, which standardizes the values to have a mean of 0 and a standard deviation of 1. 

All other numerical columns are scaled using `MinMaxScaler`, which transforms values to a range between 0 and 1. The scaled dataset is then saved to a CSV file, ensuring numerical features are appropriately normalized for further analysis or modeling.


In [238]:
def custom_scaling(df, numerical_cols):
   
    gaussian_cols = []
    other_cols = []

    for col in numerical_cols:
        if abs(df[col].skew()) < 0.5:  # Assuming skewness < 0.5 indicates Gaussian
            gaussian_cols.append(col)
        else:
            other_cols.append(col)

    if gaussian_cols:
        df[gaussian_cols] = StandardScaler().fit_transform(df[gaussian_cols])
    if other_cols:
        df[other_cols] = MinMaxScaler().fit_transform(df[other_cols])

    return df

In [239]:
df = custom_scaling(df, numerical_cols)
df.to_csv('../newData/transformed_data.csv', index=False)

### Adding PlayOffNextYear
In the following section, we added a new value `PlayOffNextYear`, in order to save, for each team, if they were qualified for the playoff of the next year. 

This script processes the playoff data for each team and year, creating a new column (`PlayOffNextYear`) to indicate whether a team will make the playoffs in the following year. It begins by sorting the dataset by `tmID` (team ID) and `year`, and filters to include only years less than or equal to 10. 

The `PlayOffNextYear` column is created by shifting the `playoff` column by one year. If the `tmID` of consecutive rows doesn't match, the `PlayOffNextYear` is set to `None`, and rows with `None` values are removed. 

The data for years 10 and 11 is then filtered separately, with the `PlayOffNextYear` set to `NaN`. These filtered rows are re-added to the dataset. A different case is handled where for team `TUL`, if the `playoff` is `1`, it is set to `0`.

In [240]:
df = pd.read_csv('../newData/transformed_data.csv')
df = df.sort_values(by=["tmID", "year"])
df = df[df['year'] <= 10]
df['PlayOffNextYear'] = df['playoff'].shift(-1)
df.loc[df['tmID'] != df['tmID'].shift(-1), 'PlayOffNextYear'] = None
df.dropna(subset=['PlayOffNextYear'] , inplace=True)


transformed_data = pd.read_csv('../newData/transformed_data.csv')
y_filtered = transformed_data[transformed_data['year'] == 10].copy()
y_filtered['PlayOffNextYear'] = np.nan
z_filtered = transformed_data[transformed_data['year'] == 11].copy()
z_filtered['PlayOffNextYear'] = np.nan

# Aggiungi le righe filtrate di y a x
df = pd.concat([df, y_filtered], ignore_index=True)
df = pd.concat([df, z_filtered], ignore_index=True)
df = df.sort_values(by=["tmID", "year"])
df.loc[(df['tmID'] == 'TUL') & (df['playoff'] == 1), 'playoff'] = 0

df.to_csv('../newData/Shifted_playoff.csv', index=False)