# Notebook - WNBA Playoffs Qualification Prediction
The objective of this project is developing a machine learning model that is able to predict which WNBA teams will be qualified to the playoffs in the next season, based on data on the previous seasons.

Authors:
- 
- Marcelo Apolinário
- Pedro Gomes

## Step 1: Data Analysis
We started by importing all data from the .csv files into DataFrames:

In [None]:
import pandas as pd

awards_players = pd.read_csv('dataset/awards_players.csv')
coaches = pd.read_csv('dataset/coaches.csv')
players = pd.read_csv('dataset/players.csv')
players_teams = pd.read_csv('dataset/players_teams.csv')
series_post = pd.read_csv('dataset/series_post.csv')
teams = pd.read_csv('dataset/teams.csv')
teams_post  = pd.read_csv('dataset/teams_post.csv')

After importing the .csv files, we started taking a look at each of the tables, to see the data we had available to work on. We also took some notes about each of the table's attributes, to make it easier to analyse and understand the data. Below, we have an explanation of each attribute of eah table and a sample of lines taken from the table.

## Table: Awards-Players

This table represents an association between a player and an award she received.

### Attribute Specification

| Attribute Name | Description |
| -- | -- |
|playerID|Player identifier|
|award|Name of the award|
|year|Year the player was awarded with this award|
|lgID|League identifier for the league where the player was awarded|

### Table Sample

In [None]:
awards_players.head(15)

### Table Analysis

In [None]:
print(awards_players.describe())

## Table: Coaches

Represents an association between a Coach, a year of the league, a team and the gathered stats about it.

### Attribute Specification
| Attribute Name | Description |
|--|--|
|coachID| Indicates which coach the stats refer to|
|year| Indicates the year the stats refer to|
|tmID| Indicates the team the stats refer to|
|lgID| Indicates the league the stats refer to|
|stint| Period of time that a player, coach, or other individual spends with a particular team or in the league itself|
|won| Number of matches won by the team in the specified year|
|lost| Number of matches lost by the team in the specified year|
|post_wins| Number of wins during playoffs|
|post_losses| Number of losses during playoffs|

### Table Sample


In [None]:
coaches.head(15)

### Table Analysis

In [None]:
print(coaches.describe())

## Table : Players

Associates a series of stats with a player.

|Attribute|Description|
|--|--|
|bioID| Indicates the player the stats refer to.|
|pos| Position the player plays in ??? ||
|firstseason| (aparece todos os valores a 0)|
|lastseason| (aparece todos os valores a 0)|
|height| Player's height|
|weight| Player's weight|
|college| College the player attended to|
|collegeOther| Another college the player attended to|
|birthDate| Player's date of birth|
|deathDate| Player's date of death ("0000-00-00" in case the player is still alive)|

### Table Sample

In [None]:
players.head(15)

### Table Analysis

In [None]:
print(players.describe())

## Table: Players-Teams
Associates a player with a year, the team they played in that year and a set of stats on their performance.

|Attribute|Description|
|--|--|
|playerID| Identifies a player|
|year| Year the player played in the team|
|stint|???|
|tmID| Identifies the team the player played in|
|lgID| Identifies the league the stats refer to|
|GP| Games Played|
|GS| Games Started(???)|
|minutes| Minutes Played|
|points| Points Scored (???)|
|oRebounds| Offensive Rebounds|
|dRebounds| Defensive Rebounds|
|rebounds| Total Rebounds|
|assists| Assists|
|steals| Steals|
|blocks| Blocks|
|turnovers| Turnovers|
|PF| ???|
|fgAttempted| Field Goals Attempted|
|fgMade| Field Goals Made|
|ftAttempted| Free Throws Attempted|
|ftMade| Free Throws Made|
|threeAttempted| Three Point Field Goals Attempted|
|threeMade| Three Point Field Goals Made|
|dq|???|
|PostGP| Games Played in the Playoffs|
|PostGS| Games Started in the Playoffs|
|PostMinutes| Minutes Played in the Playoffs|
|PostPoints| Points Scored in the Playoffs|
|PostoRebounds| Offensive Rebounds in the Playoffs|
|PostdRebounds| Defensive Rebounds in the Playoffs|
|PostRebounds| Total Rebounds in the Playoffs|
|PostAssists| Assists in the Playoffs|
|PostSteals| Steals in the Playoffs|
|PostBlocks| Blocks in the Playoffs|
|PostTurnovers| Turnovers in the Playoffs|
|PostPF| ??? in the Playoffs|
|PostfgAttempted|Field Goals Attempted in the Playoffs|
|PostfgMade|Field Goals Made in the Playoffs|
|PostftAttempted| Free Throws Attempte in the Playoffs|
|PostftMade| Free Throws Made in the Playoffs|
|PostthreeAttempted|Three Point Field Goals Attempted in the Playoffs|
|PostthreeMade|Three Point Field Goals Made in the Playoffs|
|PostDQ|???  in the Playoffs|

### Table Sample

In [None]:
players_teams.head(15)

### Table Analysis

In [None]:
print(players_teams.describe())

# Table : Series-Post

Represents the results of playoff matches along the years. It is important to notice that playoffs work in the following way:
- The 8 best teams are qualified for the playoffs
- Four pairs of teams are formed to play the Quarter Finals (round = FR) (series A to D)
- Two pairs of teams qualify to Semi Finals (round = CF) (series E and F)
- A pair of teams qualifies to Finals (round = F) (series G)
- Each series plays of in a best-of-three format

|Attribute|Description|
|--|--|
|year|Year the data is associated to|
|round|Specifies the round each match was played|
|series|Indicates the order the matches occurred|
|tmIDWinner| Identifies the winner team in the match|
|lgIDWinner| Identifies the league of the winner team|
|tmIDLoser| Identifies the loser team in the match|
|lgIDLoser| Identifies the league of the loser team|
|W|Number of Rounds Won by the Winner|
|L|Numbers of Rounds Won by the Loser|

##  Table Sample

In [None]:
series_post.head(15)

## Table Analysis

In [None]:
print(series_post.describe())

# Table : Teams

Associates a team with their data over a year.

|Attribute|Description|
|--|--|
|year|States the year the data refers to|
|lgID|Identifies the league|
|tmID|Identifies the team|
|franchID|???|
|confID|East or West Side|
|divID|???|
|rank|Position|
|playoff|Wether the team Qualified to Playoffs or not|
|seeded|???|
|firstRound|Results in the quarter finals|
|semis|Results in the semi finals|
|finals|Results in the finals|
|name|Team Name|
|o_fgm||
|o_fga||
|o_ftm||
|o_fta||
|o_3pm||
|o_3pa||
|o_oreb||
|o_dreb||
|o_reb||
|o_asts||
|o_pf||
|o_stl||
|o_to||
|o_blk||
|o_pts||
|d_fgm||
|d_fga||
|d_ftm||
|d_fta||
|d_3pm||
|d_3pa||
|d_oreb||
|d_dreb||
|d_reb||
|d_asts||
|d_pf||
|d_stl||
|d_to||
|d_blk||
|d_pts||
|tmORB||
|tmDRB||
|tmTRB||
|opptmORB||
|opptmDRB||
|opptmTRB||
|won|Matches Won|
|lost|Matches Lost|
|GP|Games Played|
|homeW|Matches Won at home stadium|
|homeL|Matches Lost at home stadium|
|awayW|Matches Won outside home stadium|
|awayL|Matches Lost outside home stadium|
|confW|???|
|confL|???|
|min|Minutes Played|
|attend|???|
|arena|Home Stadium|
## Table Sample

In [None]:
teams.head(15)

## Table Analysis

In [None]:
print(teams.describe())

# Table : Teams-Post

Represents the overall results of each team that qualified to the playoffs in each year.

|Attribute|Description|
|--|--|
|year|Year the data refers to|
|tmID|Team the data refers to|
|lgID|League the data refers to|
|W|Matches Won|
|L|Matches Lost|

## Table Sample

In [None]:
teams_post.head(15)

## Table Analysis

In [None]:
print(teams_post.describe())

## Step 2: Data Cleaning and Preprocessing

In this step the data is cleaned and organized, to ensure it's ready for further analysis and modeling.

### Table: Awards_Players

There are no missing values nor cases of data inconsistency in this table.

### Table: Coaches

There are no missing values nor cases of data inconsistency in this table.

### Table: Players

There are some issues with this table that call for cleaning, these being:

- Some players seem to lack a designated position in the column `pos`. To fix these missing values, every player that didn't have a specified position was given the position *N*.
- Some players also don't have a specified height and weight. To fix this, the mean value for each of these attributes is calculated and then given to all players whose these values are missing.
- The college names are also missing for a few players. The solution for this issue was easier, as for some entries there's already a **none** value. So it'll be assumed that the blank values also mean that the player didn't go to college.

In [None]:
# Showcase the missing and unspecified values in the aforementioned columns
print(f'Missing values in `pos`: {players["pos"].isna().sum()}')
print(f'Unspecified values in `height`: {(players["height"] == 0).sum()}')
print(f'Unspecified values in `weight`: {(players["weight"] == 0).sum()}')
print(f'Missing values in `college`: {players["college"].isna().sum()}')
print(f'Missing values in `collegeOther`: {players["collegeOther"].isna().sum()}')

# Replacing the values
players["pos"].fillna("N", inplace=True)
players["height"].replace(0, int(players["height"].mean()), inplace=True)
players["weight"].replace(0, int(players["weight"].mean()), inplace=True)
players["college"].fillna("none", inplace=True)
players["collegeOther"].fillna("none", inplace=True)

### Table: Players_Teams

There are no missing values nor cases of data inconsistency in this table.

### Table: Teams

This table has a couple of issues worthy of pointing out:

- The `divID` column is missing values for every row of the table, therefore it's best to remove it altogether.
- Features relating to the playoff results of each team: `firstRound`, `semis`, `finals`; obviously lack values for the teams that didn't make it to the playoffs that season or, for teams who did, dropped out in the earlier rounds. Ideally a solution would be sought to fix the missing data here, however for the goal of predicting playoff qualification, these features will most likely not be necessary, therefore won't need to be touched.

As for preprocessing:

- The `playoff` column is converted into a binary representation where the *Y* is 1 and *N* is 0. It'll make it easier to work with and to use with the models.

In [None]:
# Show values of `divID` column
print(f'No. of rows: {len(teams)}')
print(f'No. of missing values in `divID`: {teams["divID"].isna().sum()}')

# Drop `divID` column
teams.drop(columns=['divID'], inplace=True)

# Converting the `playoff` column to binary representation
teams["playoff"] = teams["playoff"].map({'Y': 1, 'N': 0})

### Table: Teams_Post

To ensure consistency between the `playoff` column in the *Teams* table and the presence of a team in this table, two sets are created for each and then their difference computed.

In [None]:
# Creating a set of (year, tmID) pairs for teams in the post-season data
post_season_teams = set(teams_post[['year', 'tmID']].itertuples(index=False))

# Creating a similar set for teams marked as 'playoff' in the regular season data
regular_season_playoff_teams = set(teams[teams['playoff'] == 1][['year', 'tmID']].itertuples(index=False))

# Checking if the sets are equal
sets_are_equal = post_season_teams == regular_season_playoff_teams
print(f'Are the sets consistent? {"Yes" if sets_are_equal else "No" }.')

## Step 3: Exploratory Data Analysis

Here, in this step, descriptive statistics and visualizations are generated to better understand the data and look for any patterns in it.

- The bar plot below displays the 20 most awarded players and the number of awards they received

In [None]:
import matplotlib.pyplot as plt

# TOP 20 PLAYERS BY NUMBER OF AWARDS

# Group data by player and count the number of awards
award_counts = awards_players['award'].groupby(awards_players['playerID']).count()

# Sort the players by the number of awards received in descending order
award_counts = award_counts.sort_values(ascending=False)

# Create a horizontal bar chart
award_counts[:20].plot(kind='barh', figsize=(12, 8))  # Display the top 20 players
plt.title('Top 20 Players by Number of Awards')
plt.xlabel('Number of Awards')
plt.ylabel('Player ID')
plt.gca().invert_yaxis()  # Invert the y-axis for the highest award count at the top
plt.show()

- This next one displays the teams and the amount of times each one has made it to the playoffs. From the graph, it's possible to see that a good amount of teams consistently make it to the playoffs across seasons, namely LAS having done so in all seasons except for one. There are also however some teams that have never made it, and a few that made it just once.

In [None]:
# Plotting teams and the amount of times they've made it to the playoffs
plt.figure(figsize=(8, 6))
teams.groupby(teams["tmID"]).sum()["playoff"].plot(kind='bar', color='skyblue')
plt.title('Distribution of Teams: No. of Playoffs Reached')
plt.xlabel('Team ID')
plt.ylabel('No. of Playoffs Reached')
plt.show()

- Next, let's explore the descriptive statistics of a team performance metrics, focusing on a few key metrics that are more likely to influence playoff qualification:
   - **Wins** (total, home, away, and within the conference)
   - **Losses** (total, home, away, and within the conference)
   - **Rank**
   - **Attendance**: This metric can reflect a team's popularity and/or fan engagement  
<br>  
  
- A summary of the results:
   - **Wins**: Teams that made it to the playoffs won, on average, around 20 games across the season, whereas teams that didn't make it only won around 12. Qualifying teams consistently won more games, whether home, away and within the conference than non-qualifying teams. As expected, teams have, on average, more Home Wins than Away Wins.
   - **Losses**: Playoff teams average around 13 losses per season, with non-playoff teams losing about 21 games.
   - **Rank**: Since playoff qualification is based on league rank, the playoff qualifying teams must be, and the metrics indeed show they are, ranked higher than non-qualifying ones. 
   - **Attendance**: The attendance for playoff teams is higher than that of non-playoff ones, around 14% percent higher more precisely.

In [None]:
# Selecting key performance metrics
key_metrics = ['won', 'homeW', 'awayW', 'confW', 'lost', 'homeL', 'awayL', 'confL', 'rank', 'attend']

# Grouping by 'playoff' and computing the mean for each metric
grouped_metrics = teams.groupby('playoff')[key_metrics].mean()

grouped_metrics

- To better visualize some of these metrics, these plots have been generated, comparing average wins and average losses between playoff and non-playoff teams.

In [None]:
# Plotting average wins and losses for playoff vs. non-playoff teams
fig, ax = plt.subplots(1, 2, figsize=(16, 6))

# Wins
grouped_metrics['won'].plot(kind='bar', ax=ax[0], color=['skyblue', 'salmon'])
ax[0].set_title('Average Wins: Playoffs vs. No Playoffs')
ax[0].set_xlabel('Made it to Playoffs')
ax[0].set_ylabel('Average Wins')
ax[0].set_xticklabels(['No', 'Yes'], rotation=0)

# Losses
grouped_metrics['lost'].plot(kind='bar', ax=ax[1], color=['skyblue', 'salmon'])
ax[1].set_title('Average Losses: Playoffs vs. No Playoffs')
ax[1].set_xlabel('Made it to Playoffs')
ax[1].set_ylabel('Average Losses')
ax[1].set_xticklabels(['No', 'Yes'], rotation=0)

plt.tight_layout()
plt.show()

- The following histogram shows us how many coaches achieved a specific number of victories with the teams they trained along the years in the WNBA. With this histogram, we can conclude that both victories and losses are fairly well distributed between coaches. There is a lot of coaches with around 16 losses and a considerable amount with around 18 wins and the rest are scattered between 0 and 30 wins and losses respectively. 

In [None]:
# COACHES HISTOGRAM

import seaborn as sns

# Select 'won' and 'lost' columns
won = coaches['won']
lost = coaches['lost']

# Create subplots to display histograms side by side
plt.figure(figsize=(12, 6))

# Create a histogram for the 'won' column
plt.subplot(1, 2, 1)
plt.hist(won, bins=20, color='green', alpha=0.7)
plt.title('Distribution of Wins with Coaches')
plt.xlabel('Number of Wins')
plt.ylabel('Frequency')

# Create a histogram for the 'lost' column
plt.subplot(1, 2, 2)
plt.hist(lost, bins=20, color='red', alpha=0.7)
plt.title('Distribution of Losses with Coaches')
plt.xlabel('Number of Losses')
plt.ylabel('Frequency')

# Display the subplots
plt.tight_layout()
plt.show()

- According to the box plots below, the coaches are fairly spread between 0 and 25 in number of wins and between 0 and 30 in number of losses. However, 50% of the coaches have between 10 and 20 wins and, in terms of losses, 50% of the registered coaches have between 12 and 18 losses, approximately. This tells us that despite having some exceptionally good coaches and some exceptionally bad, the vast majority of them are failry well concentrated around the medium value of both wins and losses.

In [None]:
# COACHES WON-LOST BOX PLOT
# Create a box plot for 'won'
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.boxplot(won, vert=False)
plt.title('Box Plot for Wins with Coaches')
plt.xlabel('Number of Wins')

# Create a box plot for 'lost'
plt.subplot(1, 2, 2)
plt.boxplot(lost, vert=False)
plt.title('Box Plot for Losses with Coaches')
plt.xlabel('Number of Losses')

# Display the box plots
plt.tight_layout()
plt.show()

- The Correlation Matrix below helps us understand how the Coach statistics correlate with each other. The conclusions we got from this plot were that:
    - The year is weakly positive correlated to any of the other attributes of a Coach, meaning it influences wins, losses, playoff wins and playoff losses minimally but in a positive way
    - The number of wins has a weak negative influence on the number of losses but a strong positive influence in the number of wins and losses on the playoffs
    - The number of losses has a weak negative influence on the number of wins and a stronger but still weak negative influence in the number of wins and losses on the playoffs
    - The playoff wins have a strong positive correlation with pre-playoff wins and playoff losses
    - The playoff losses have a strong positive correlation with pre-playoff wins and playoff wins

In [None]:
import seaborn as sns

# COACHES HEATMAP
# Select the columns for the correlation analysis
selected_columns = ['year', 'won', 'lost', 'post_wins', 'post_losses']

# Create a subset of the dataset with selected columns
coaches_subset = coaches[selected_columns]

# Calculate the correlation matrix
correlation_matrix = coaches_subset.corr()

# Create a heatmap to visualize the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Correlation Matrix of Coach Variables')
plt.show()

- With this plot, we can conclude that Guard is the most played position amongst the players and Center-Forward is the least played position in the WNBA. We can also visualize the distribution of players along other positions as well. It's worth noting that a player doesn't necessarily play always the same position.

In [None]:
# Create a bar chart for the distribution of player positions
position_counts = players['pos'].value_counts()
position_counts.plot(kind='bar', figsize=(10, 6))
plt.title('Distribution of Player Positions')
plt.xlabel('Player Position')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

- This plot allows us to conclude that 0.4% of the players in our dataset have already died and thus there are players that we have to consider in previous years (when they were alive and playing) but can't consider them now as part of any team.

In [None]:
# Count of players still alive and those who have passed away
players['deathDate'] = pd.to_datetime(players['deathDate'], errors='coerce')
alive_players = players['deathDate'].isna().sum()
passed_away_players = players['deathDate'].notna().sum()

plt.figure(figsize=(6, 6))
plt.pie([alive_players, passed_away_players], labels=['Alive', 'Passed Away'], autopct='%1.1f%%', startangle=140)
plt.title('Players Status')
plt.show()

- To analyze the frequence of players with high performance statistics, we elaborated histograms comparing the amount of players with certain values of performance statistics. Here's what we concluded:
    - Players are fairly distributed when it comes to minutes played, obviously having more players leaning to having played less time
    - Some performance statistics are significantly lower in the majority of players (namely Free Throws and Three Point Goals) while others have a better distribution of players through high and low performance values (like Points Scored, Rebounds, Assists, Steals and Blocks)

In [None]:
# Merge the 'Players' and 'Players-Teams' tables based on 'bioID'
merged_data = players_teams.merge(players[['bioID', 'birthDate']], left_on='playerID', right_on='bioID', how='inner')

# Select the performance statistics columns you want to create histograms for
performance_columns = ['minutes', 'points', 'oRebounds', 'dRebounds', 'assists', 'steals', 'blocks', 'turnovers', 'PF',
                      'fgMade', 'fgAttempted', 'ftMade', 'ftAttempted', 'threeMade', 'threeAttempted']

# Define the number of columns and rows for subplots
num_columns = 4
num_rows = (len(performance_columns) + num_columns - 1) // num_columns  # Calculate the number of rows

# Create histograms for each selected performance statistic
plt.figure(figsize=(15, 5 * num_rows))

for i, column in enumerate(performance_columns, 1):
    plt.subplot(num_rows, num_columns, i)
    plt.hist(merged_data[column], bins=20, color='blue', alpha=0.7)
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Count')

plt.tight_layout()
plt.show()

- The plots below allow us to visualize the best 20 players in each statistic side by side, to have an idea on the top values of the league.

In [None]:
# Merge the 'Players' and 'Players-Teams' tables based on 'bioID'
merged_data = players_teams.merge(players[['bioID', 'birthDate', 'pos']], left_on='playerID', right_on='bioID', how='inner')

# List of performance statistics you want to consider
performance_statistics = ['points', 'rebounds', 'assists']

# Create subplots for each performance statistic
fig, axs = plt.subplots(len(performance_statistics), 1, figsize=(10, 15))

for i, stat in enumerate(performance_statistics):
    # Group data by player and calculate the total statistic for each player
    player_stats = merged_data.groupby('bioID')[stat].sum()

    # Sort players by the performance statistic in descending order and select the top 20
    top_players = player_stats.sort_values(ascending=False).head(20)

    # Create a bar chart for the top 20 players in the current statistic
    axs[i].barh(top_players.index, top_players.values, color='blue', alpha=0.7)
    axs[i].set_title(f'Top 20 Players in {stat}')
    axs[i].set_xlabel(stat)
    axs[i].set_ylabel('Player')
    axs[i].invert_yaxis()

plt.tight_layout()
plt.show()

- The plot below shows us how the points scored, assists and rebounds in matches distribute themselves between teams in the 10 seasons we have data on.

In [None]:
# Merge the 'Players' and 'Players-Teams' tables based on 'bioID'
merged_data = players_teams.merge(players[['bioID', 'birthDate']], left_on='playerID', right_on='bioID', how='inner')

# List of performance statistics you want to create histograms for
performance_statistics = ['points', 'rebounds', 'assists']

# Create histograms for each performance statistic separated by team
for stat in performance_statistics:
    # Group the data by team and calculate statistics for each team
    team_stats = merged_data.groupby('tmID')[stat].mean()

    # Create histograms for each team's performance
    plt.figure(figsize=(12, 6))
    team_stats.plot(kind='bar', color='blue', alpha=0.7)
    plt.title(f'{stat} Distribution by Team')
    plt.xlabel('Team')
    plt.ylabel(stat)
    plt.xticks(rotation=45)
    plt.show()

- According to this correlation matrix between players performance statistic and wins and losses of their teams, we can take many conclusions:
    - Minutes Played has (as expected) a strong positive impact in all other player performances, despite having a weak megative impact on number of wins and losses
    - Assists and Offensive Rebounds do not strongly correlate to each other, having only a weak positive impact over the other. The same happens between Assists and Blocks.
    - The rest of the player performance statistics strongly correlate positively to each  other, with an highlight for Points Scored, that is strongly positively impacted by all of them except Blocks.
    - The number of wins and losses of a team is not strongly impacted by any of the statistics individually, meaning that one of them alone wouldn't be enough to define a win or a loss in a game.

In [None]:

# Merge the 'Players' and 'Players-Teams' tables based on 'bioID'
merged_data = players_teams.merge(players[['bioID', 'birthDate', 'pos']], left_on='playerID', right_on='bioID', how='inner')

# Filter data for matches before the playoffs
before_playoffs_data = merged_data[merged_data['stint'] == 1]

# Merge 'Coaches' data to associate players with team results before playoffs
combined_data = before_playoffs_data.merge(coaches[['year', 'tmID', 'won', 'lost']], on=['year', 'tmID'], how='inner')

# Select relevant player performance statistics and team results columns
performance_statistics = ['minutes','points', 'oRebounds', 'dRebounds', 'assists', 'steals', 'blocks', 'turnovers', 'PF',
                         'fgMade', 'fgAttempted', 'ftMade', 'ftAttempted', 'threeMade', 'threeAttempted']
team_results = ['won', 'lost']

# Calculate the correlation matrix
correlation_matrix = combined_data[performance_statistics + team_results].corr()

# Create a heatmap to visualize the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Correlation Matrix of Player Performance and Team Results (Before Playoffs)')
plt.show()

- This plot allows us to see each team amount of wins along the seasons and compare them with each other. Analysing it, we can conclude that there were teams that were on top some seasons ago and are now decaying in terms of wins while some teams that were lower positioned before are now doing better. This allows us to understand that despite a team having won a lot of matches last year, as players and coaches switch and their performance varies, it is possible for a team to have a completely different performance next season.

In [None]:

# Sort the data by the year (or relevant time variable).
teams_sorted = teams.sort_values(by='year')

# Create a line chart to visualize the wins of all teams over time.
plt.figure(figsize=(12, 6))

# Define a list of distinct line colors for the teams.
colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k', 'orange', 'purple', 'pink',
          'lime', 'brown', 'gray', 'olive', 'cyan', 'teal', 'indigo', 'gold', 'darkred']

# Iterate over each team and set the line color from the 'colors' list.
for i, (team, data) in enumerate(teams_sorted.groupby('franchID')):
    color = colors[i % len(colors)]
    plt.plot(data['year'], data['won'], label=team, marker='o', color=color)

# Customize the chart labels and title.
plt.xlabel('Year')
plt.ylabel('Wins')
plt.title('Wins of All Teams Over Time')
plt.xticks(rotation=45)  # Rotate x-axis labels for readability

# Add a legend to distinguish teams and place it in the upper left corner.
plt.legend(loc='lower right')

# Show the line chart.
plt.grid(True)
plt.tight_layout()
plt.show()

- This plot shows us the percentage of seasons where each team qualified for the playoffs. Judging by this plot and assuming there weren't a lot of changes to the teams, there are some teams that are more probably going to qualify in the next season.

In [None]:

# Group the data by team and calculate the playoff qualification rate.
team_playoff_stats = teams.groupby('franchID').agg({'playoff': 'sum', 'year': 'count'})

# Calculate the playoff qualification rate as the number of playoff appearances divided by the total seasons.
team_playoff_stats['qualification_rate'] = (team_playoff_stats['playoff'] / team_playoff_stats['year']) * 100

# Sort the data by qualification rate in descending order.
team_playoff_stats = team_playoff_stats.sort_values(by='qualification_rate', ascending=False)

# Create a bar chart to display the playoff qualification rate for each team.
plt.figure(figsize=(12, 6))
plt.bar(team_playoff_stats.index, team_playoff_stats['qualification_rate'], color='skyblue')
plt.xlabel('Team')
plt.ylabel('Playoff Qualification Rate (%)')
plt.title('Playoff Qualification Rate for Each Team')
plt.xticks(rotation=90)  # Rotate x-axis labels for readability

# Show the bar chart.
plt.grid(axis='y')
plt.tight_layout()
plt.show()

- Observing this bar chart, we can conclude that West Conference teams have slightly won more matches than East Conference teams but both seem fairly balanced. 

In [None]:
# Group the data by conference and calculate the average number of wins and losses.
conference_stats = teams.groupby('confID').agg({'won': 'mean', 'lost': 'mean'})

# Create a bar chart to compare team performances by conference.
plt.figure(figsize=(10, 6))
conference_stats.plot(kind='bar', color=['skyblue', 'lightcoral'], alpha=0.7, stacked=False)
plt.xlabel('Conference')
plt.ylabel('Average Wins and Losses')
plt.title('Team Performance by Conference')
plt.xticks(rotation=0)  # Remove rotation for horizontal labels

# Show the bar chart.
plt.grid(axis='y')
plt.tight_layout()
plt.show()

- This plot allows us to see how team performance varies each year with different team attendance values. From here we can conclude:
    - Teams that had the bigger attendance values are also the ones with better performances but its also in that zone of the statistics that we find the teams with worse performances
    - Teams that have a really low attendance or a really high attendance have a performance closer to average

In [None]:

# Select the columns for attendance and performance
attendance_column = 'attend'
performance_column = 'won'  # You can choose 'won', 'lost', or other performance metric

# Create the scatter plot
plt.figure(figsize=(10, 6))
scatter = plt.scatter(teams[attendance_column], teams[performance_column], c='b', alpha=0.5, edgecolors='k')

# Add labels for individual data points (teams) with their respective years
for i, team in enumerate(teams['tmID']):
    plt.annotate(f'{team} ({teams["year"].iloc[i]})', (teams[attendance_column][i], teams[performance_column][i]))

plt.xlabel('Attendance')
plt.ylabel('Performance (Wins)')
plt.title('Team Attendance vs. Performance Scatter Plot')

# Show the scatter plot
plt.grid(True)
plt.tight_layout()
plt.show()

- This Correlation Matrix compare Team Performance Statistics and their Attendance values. From this plot we can conclude that:
    - Games Played and Attendance are obviously strongly positively correlated and so are the total wins with each type of win counts (same for losses)
    - Attendance is not positiviely correlated to any of the team performance statistics individually, meaning that just the attendance isn't enough to determine a team's performance during a season

In [None]:
# Select columns for attendance and performance statistics (excluding opponent statistics)
selected_columns = ['attend', 'won', 'lost', 'GP', 'homeW', 'homeL', 'awayW', 'awayL',
                    'confW', 'confL', 'min', 'o_3pm', 'd_3pm', 'o_pts', 'd_pts']

# Create a new DataFrame with only the selected columns
team_performance = teams[selected_columns]

# Calculate the correlation matrix
correlation_matrix = team_performance.corr()

# Create a heatmap of the correlation matrix using Seaborn
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Correlation Matrix: Attendance vs. Performance Statistics')
plt.show()

- Based on the box plot below we can conclude that:
    - Home Wins are more frequent than Away Wins
    - Home Win amount is more spread than Away Win amount, even in terms of the 50% of the data more close to average 

In [None]:

# Select columns for home wins and away wins
home_wins = teams['homeW']
away_wins = teams['awayW']

# Create a new DataFrame with selected columns
wins_data = pd.DataFrame({'Home Wins': home_wins, 'Away Wins': away_wins})

# Create a box plot to compare home wins and away wins
plt.figure(figsize=(6, 4))

plt.boxplot(wins_data.values, labels=wins_data.columns, vert=False)
plt.title('Home Wins vs. Away Wins')
plt.xlabel('Wins')
plt.ylabel('Game Locations')

plt.tight_layout()
plt.show()

## Step 4: Feature Engineering

In this step, we will decide what features to use in the model and how put them together.


- The player awards intend to demonstrate the value that player/coach has, its remarkable capabilities that stand out from the rest, and the potential it can bring to their team's success. 

  Therefore, it would make sense to assume that a team whose members have attained more awards, likely has a higher chance to succeed and make the playoffs. There are also several kind of awards, some more important than others - especially in the sense of influencing a team's overall success (eg: sportmanship awards) - that hold more meaning and represent higher accolades.  
  
  In this vein, each award will be associated with an arbitrary numerical value, the higher this value being, the more valuable the award is.

In [None]:
# Show all the different awards present in the *awards_players* table
awards_players['award'].unique()

In [None]:
# Assign numerical values to each award
award_values = {
    'All-Star Game Most Valuable Player': 3.5,
    'Coach of the Year': 3,
    'Defensive Player of the Year': 2,
    'Kim Perrot Sportsmanship': 1,
    'Kim Perrot Sportsmanship Award': 1,
    'Most Improved Player': 2,
    'Most Valuable Player': 4,
    'Rookie of the Year': 2.5,
    'Sixth Woman of the Year': 1.5,
    'WNBA Finals Most Valuable Player': 3,
    'WNBA All-Decade Team': 5,
    'WNBA All Decade Team Honorable Mention': 4.5
}

# Create new column with the numerical value for each award
awards_players['award_value'] = [award_values[award] for award in awards_players['award']]

awards_players

- Many attributes across the tables are redundant, for example, the stats displayed in the team's *teams* table is just an aggregation and sum of the individual player's stats of that team in that year in the respective *players_teams* table. Since the data is already mostly summarized in the *teams* table, that will be the main table to look at as the features to be used in the models are selected.  

  Some tables also contain data pertaining to the post season, which is outside the scope of determining whether a team makes it in the post season in the first place. Including this kind of information in our models would lead to data leakage, thus they'll be left out.  
  
- Looking into the *teams* table, and the heatmap showcasing the correlation between its features and the target, these are the features selected to train the models:

|       Name        | Feature  |
| ----------------- | :------: |
| Year              | `year`   |
| Team              | `tmID`   |
| Wins              | `won`    |
| Home wins         | `homeW`  |
| Away wins         | `awayW`  |
| Conf wins         | `confW`  |
| Losses            | `lost`   |
| Home losses       | `homeL`  |
| Away losses       | `awayL`  |
| Conf losses       | `confL`  |
| Rank              | `rank`   |
| Attendance        | `attend` |
| Team's points     | `d_pts`  |
| Opponent's points | `o_pts`  |
   
<br>

- To cut down on the number of features selected, some of them are fused into new features, for example:
   - Wins and losses combine to form a win/loss ratio
   - Point scored against are deducted from the points score, becoming a point difference

In [None]:
# Filter the *teams* table by the selected features
teams_selected_features = ['year', 'tmID', 'won', 'homeW', 'awayW', 'confW', 'lost', 'homeL', 'awayL', 'confL',
                           'rank', 'attend', 'd_pts', 'o_pts']
train_df = teams[teams_selected_features]

# Create the new features, combining exisiting ones
train_df['win_ratio'] = train_df['won'] / train_df['lost']
train_df['home_win_ratio'] = train_df['homeW'] / train_df['homeL']
train_df['away_win_ratio'] = train_df['awayW'] / train_df['awayL']
train_df['conf_win_ratio'] = train_df['confW'] / train_df['confL']
train_df['point_dif'] = train_df['d_pts'] - train_df['o_pts']

# Drop the now redundant features
train_df.drop(columns=['won', 'homeW', 'awayW', 'confW', 'lost', 'homeL', 'awayL', 'confL', 'd_pts', 'o_pts'], inplace=True)

train_df

- To these, the average player's height and weight of each team in every year is added.

In [None]:
# Merge player data with their respective team and year in *players_team* table
player_data_team_merged_df = pd.merge(players, players_teams, left_on='bioID', right_on='playerID')
player_data_team_merged_df = player_data_team_merged_df[['bioID', 'tmID', 'year', 'height', 'weight']]

# Calculate the average height and weight of players of every team for every year
avg_stats = round(player_data_team_merged_df.groupby(['tmID', 'year'])[['height', 'weight']].mean(), 2)
avg_stats.rename(columns={"height": "avg_player_height", "weight": "avg_player_weight"}, inplace=True)

# Combine the average player stats with the training data by team and year
train_df = pd.merge(train_df, avg_stats, on=["tmID", "year"])

- Finally, merge with the player and coaches awards' values, also by team and per year.

In [None]:
# Merge player award data with their respective team and year in *players_team* table
player_award_team_merged_df = pd.merge(awards_players, players_teams, on=['playerID', 'year', 'lgID'])
player_award_team_merged_df = player_award_team_merged_df[['playerID', 'tmID', 'year', 'award', 'award_value']]

# Calculate the sum of players awards' values of every team for every year
player_award_values_sum = player_award_team_merged_df.groupby(['tmID', 'year'])['award_value'].sum()
player_award_values_sum.name = 'sum_award_values'

# Combine the sum player award values with the training data by team and year
train_df = pd.merge(train_df, player_award_values_sum, how="left", on=["tmID", "year"])

# Merge coach award data with their respective team and year in *coaches* table
coach_award_team_merged_df = pd.merge(awards_players, coaches, left_on=['playerID', 'year', 'lgID'], right_on=['coachID', 'year', 'lgID'])
coach_award_team_merged_df.rename(columns={'award_value': 'coach_award_value'}, inplace=True)
coach_award_team_merged_df = coach_award_team_merged_df[['tmID', 'year', 'coach_award_value']]

# Combine the coach award values with the training data by team and year
train_df = pd.merge(train_df, coach_award_team_merged_df, how="left", on=["tmID", "year"])

# Replace NaN values from the resulting merges with 0
train_df.fillna(0, inplace=True)

# Sum the players awards' value with the coach's
train_df['sum_award_values'] = train_df['sum_award_values'] + train_df['coach_award_value']
train_df.drop(columns=['coach_award_value'], inplace=True)

The final training dataset looks as shown below:

In [None]:
# Display the train dataset
train_df

## Step 5: Model Evaluation

In this step, the data is split into training and testing sets, and then a series of models are trained and evaluated.

I will start by using data from wins, losses, team rank, attendance and offensive and defensive stats to predict playoff qualification. I will use years 1 to 9 to train and predict year 10. To start, we will try random forest.

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import confusion_matrix

selected_features = ['homeW', 'homeL', 'awayW', 'awayL', 'rank', 'attend', 'year']
X = teams[selected_features]

y = teams['playoff']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

predictions = model.predict(X_test)

error = mean_absolute_error(y_test, predictions)
print(f'Mean Absolute Error: {error}')

cm = confusion_matrix(y_test, predictions)

plt.figure(figsize=(10,7))
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Truth')
plt.show()