## **Machine Learning - WNBA Playoffs Prediction**
This notebook will focus on the undestanding of the data. We will be using SQLite to store the data due to its scalability & the fact that it's a relational schema.

https://docs.python.org/3/library/sqlite3.html

Import sqlite3 and connect to database file

### **Imports**

In [1]:
import pandas as pd
import sqlite3
import prep_utils as pu 
import raw_prep_utils as ru
import sys
import os
import seaborn as sns
import matplotlib.pyplot as plt

### **Database Connection Setup**

In [2]:
db = sqlite3.connect("db/ac.db")
db_cur = db.cursor()

[df_awards, df_coaches, df_players_teams, df_players, df_series_post, df_teams_post, df_teams] = pu.db_to_pandas(db)

In [3]:
oi = ru.merge_all_raw_data(df_teams,df_players_teams,df_coaches,df_awards,df_teams_post,1)
print(oi.to_string())

Dropping Attribute lgID in [1mCoaches[0m...
Creating attribute coach previous regular season win ratio...
Creating attribute coach playoffs win ratio...
Creating attribute coach playoffs count...
Creating attribute coach awards count...

[1mCoaches Null Verification:[0m
year                     0
tmID                     0
coachID                  0
total_reg_season_win     0
total_reg_season_lost    0
total_playoffs_win       0
total_playoffs_lost      0
coach_playoffs_count     0
coach_awards             0
dtype: int64
Dropping Attribute lgID in [1mPlayers_Teams[0m...
Dropping divID in [1mTeams[0m...
Dropping ldID in [1mTeams[0m...
Dropping seeded in [1mTeams[0m...
Dropping tmORB, tmDRB, tmTRB, opptmORB, opptmDRB, opptmTRB in [1mTeams[0m...
Converting Target PLAYOFF to binary on[1mTeams[0m...
Creating attribute winrate [1mTeams[0m...
Creating attribute PlayOffs winrate [1mTeams[0m...
     year tmID franchID confID  rank  playoff firstRound semis finals            

***Prepare Coaches Dataframe***

In [4]:
df_new_coaches = pu.prepare_coaches(df_coaches, df_awards,1)
df_new_coaches = pu.group_coaches(df_new_coaches)
print(df_new_coaches.to_string())

Dropping Attribute lgID in [1mCoaches[0m...
Creating attribute coach previous regular season win ratio...
Creating attribute coach playoffs win ratio...
Creating attribute coach playoffs count...
Creating attribute coach awards count...
Dropping attribute post_wins..
Dropping attribute post_losses..
Dropping attribute won..
Dropping attribute lost..

[1mCoaches Null Verification:[0m
year                    0
tmID                    0
coachID                 0
coach_reg_season_wr     0
coach_po_season_wr      0
coach_playoffs_count    0
coach_awards            0
dtype: int64
     year tmID     coachID  coach_reg_season_wr  coach_po_season_wr  coach_playoffs_count  coach_awards
0       1  CHA  dunntr01wc             0.000000            0.000000                     0             0
1       1  CLE  hugheda99w             0.000000            0.000000                     0             0
2       1  DET  liebena01w             0.000000            0.000000                     0             0

***Prepare Players Dataframe***


In [5]:
df_new_players_teams = pu.prepare_player_teams(df_players_teams,df_awards,1)

Dropping Attribute lgID in [1mPlayers_Teams[0m...
        playerID  year tmID    GP    GS  minutes  points  oRebounds  dRebounds  rebounds  assists  steals  blocks  turnovers     PF  fgAttempted  fgMade  ftAttempted  ftMade  threeAttempted  threeMade   dq  PostGP  PostGS  PostMinutes  PostPoints  PostoRebounds  PostdRebounds  PostRebounds  PostAssists  PostSteals  PostBlocks  PostTurnovers  PostPF  PostfgAttempted  PostfgMade  PostftAttempted  PostftMade  PostthreeAttempted  PostthreeMade  PostDQ  player_awards
0     abrossv01w     2  MIN   0.0   0.0      0.0     0.0        0.0        0.0       0.0      0.0     0.0     0.0        0.0    0.0          0.0     0.0          0.0     0.0             0.0        0.0  0.0     0.0     0.0          0.0         0.0            0.0            0.0           0.0          0.0         0.0         0.0            0.0     0.0              0.0         0.0              0.0         0.0                 0.0            0.0     0.0              0
1     abrossv0

***Prepare Teams Dataframe***

In [6]:
df_new_teams = pu.prepare_teams(df_teams,df_teams_post,1)
print(df_new_teams.to_string())

Dropping divID in [1mTeams[0m...
Dropping ldID in [1mTeams[0m...
Dropping seeded in [1mTeams[0m...
Dropping tmORB, tmDRB, tmTRB, opptmORB, opptmDRB, opptmTRB in [1mTeams[0m...
Dropping GP, homeW, homeL, awayW, awayL, confW, confL, attend, name, confID, franchID & arena in [1mTeams[0m...
Converting Target PLAYOFF to binary on[1mTeams[0m...
Creating attribute winrate [1mTeams[0m...
Dropping won & lost in [1mTeams[0m...
Creating attribute PlayOffs winrate [1mTeams[0m...
     year tmID  rank  playoff   o_fgm   o_fga  o_ftm  o_fta  o_3pm  o_3pa  o_oreb  o_dreb   o_reb  o_asts   o_pf  o_stl   o_to  o_blk   o_pts   d_fgm   d_fga  d_ftm  d_fta  d_3pm  d_3pa  d_oreb  d_dreb   d_reb  d_asts   d_pf  d_stl   d_to  d_blk   d_pts   min  team_playoffs_count   Winrate  PO_Winrate
0       9  ATL   0.0        0     0.0     0.0    0.0    0.0    0.0    0.0     0.0     0.0     0.0     0.0    0.0    0.0    0.0    0.0     0.0     0.0     0.0    0.0    0.0    0.0    0.0     0.0     0.0     0

In [7]:
df_new_player_rankings = pu.prepare_players_for_ranking(df_players_teams, df_awards)
feature_importance, df_new_players = pu.feature_importance_players(df_new_player_rankings, df_players,df_teams)



Mean Squared Error for G: 0.2662291338582678
Feature importance for G:
fg%: 0.13789774098654722
PER: 0.09742856818861576
3pt%: 0.08969237231324588
PPM: 0.08884032640694293
ft%: 0.08783498981099708
PF: 0.08062104984339317
assists: 0.07905792936763138
turnovers: 0.06326622782504805
steals: 0.05544673685245969
oRebounds: 0.054047391428780234
rebounds: 0.051503118877998834
dRebounds: 0.049386817475605166
blocks: 0.04888798021651944
dq: 0.014743342130569572
player_awards: 0.0013454082756455587

Mean Squared Error for C-F: 0.30338888888888893
Feature importance for C-F:
blocks: 0.2665339784313408
PER: 0.1033268042768388
fg%: 0.10009581498944511
PPM: 0.0793412305676557
turnovers: 0.07659523381323619
assists: 0.07502547720599853
ft%: 0.06572347606323309
PF: 0.04916890170686998
dRebounds: 0.044470488858616486
steals: 0.0353591838550666
oRebounds: 0.03443375660482893
rebounds: 0.031068332617293016
3pt%: 0.03048899251731528
dq: 0.00836832849226145
player_awards: 0.0

Mean Squared Error for C: 0.

In [8]:
rank_players_regular = pu.ranking_players(feature_importance, df_new_players)
print('Best players in the regular season: ')
print(rank_players_regular)


Best players in the regular season: 
        playerID  year    rating
1074  jacksla01w     8  0.520489
204   jacksla01w     4  0.509309
1155  parkeca01w     9  0.502172
1580  catchta01w     3  0.498644
1655  leslili01w     5  0.494160
...          ...   ...       ...
1372  gaithka01w     3  0.044037
543   berezva01w     9  0.043253
860   weberma01w     8  0.043191
959   oneilkr01w     9  0.042937
1531  chambco01w     8  0.000000

[1805 rows x 3 columns]


In [9]:
rank_playoff_players = pu.ranking_playoff_players(feature_importance, df_new_players)
print('Best players in the playoffs: ')
print(rank_playoff_players)

Best players in the playoffs: 
        playerID  year  PostRating
395    zollsh01w     9    0.060429
1393  zirkozu01w     4    0.060429
657   zellosh01w    10    0.360786
835    zarafr01w     6    0.364128
1322  zakalok01w     1    0.051230
...          ...   ...         ...
33    abrossv01w     5    0.226121
45    abrossv01w     6    0.066756
59    abrossv01w     7    0.066756
81    abrossv01w     9    0.235882
0     abrossv01w     2    0.066756

[1805 rows x 3 columns]


In [10]:
power_ratings = pu.team_power_rating(df_teams, df_new_players)


sorted_power_ratings = power_ratings.sort_values(by=['year', 'PowerRating'], ascending=[True, False])
print(sorted_power_ratings)
# Group by year and select the top 6 teams
top_teams_by_year = sorted_power_ratings.groupby('year').head(8)

# Count how many of the top 6 teams for each year made the playoffs
playoffs_made_by_year = top_teams_by_year.groupby('year')['playoff'].apply(lambda x: (x == 'Y').sum()).reset_index()

# Print or use the results
for index, row in playoffs_made_by_year.iterrows():
    print('Year ' + str(row['year']) + ' based on Power Ratings ' + str(row['playoff']) + '/8 best teams made the playoffs')

print('Ranking System Accuracy: ' + str(playoffs_made_by_year['playoff'].sum()/ (8*len(playoffs_made_by_year))) + '%')

     year tmID  PowerRating playoff  rank
5       1  LAS     0.344478       Y     1
8       1  NYL     0.342137       Y     1
3       1  HOU     0.335949       Y     2
12      1  SAC     0.324381       Y     3
9       1  ORL     0.316429       Y     3
..    ...  ...          ...     ...   ...
135    10  MIN     0.277680       N     5
136    10  NYL     0.274123       N     7
131    10  CON     0.270214       N     6
130    10  CHI     0.263130       N     5
138    10  SAC     0.237687       N     6

[142 rows x 5 columns]
Year 1 based on Power Ratings 8/8 best teams made the playoffs
Year 2 based on Power Ratings 8/8 best teams made the playoffs
Year 3 based on Power Ratings 8/8 best teams made the playoffs
Year 4 based on Power Ratings 7/8 best teams made the playoffs
Year 5 based on Power Ratings 7/8 best teams made the playoffs
Year 6 based on Power Ratings 7/8 best teams made the playoffs
Year 7 based on Power Ratings 7/8 best teams made the playoffs
Year 8 based on Power Ratings 7

In [11]:
best_colleges = pu.best_colleges(df_players_teams,df_teams,df_players)


print(best_colleges)

                    college  TotalPlayoffAppearances  CollegeRank
88                Tennessee                       21            1
17              Connecticut                       17            2
31                  Georgia                       15            3
86                 Stanford                       12            4
48           Louisiana Tech                       11            5
..                      ...                      ...          ...
80               Seton Hall                        1           14
25    Florida International                        1           14
82     Southern Mississippi                        1           14
50                    Maine                        1           14
0   Academy of Sport Moscow                        1           14

[113 rows x 3 columns]


In [12]:
awards = pu.player_awards(df_new_players,df_awards)


# get a player, order by year
player = awards[awards['playerID'] == 'leslili01w']
player = player.sort_values(by=['year'], ascending=[True])

print(player)


        playerID  year  award  cumulative_awards
2750  leslili01w     1      0                0.0
2751  leslili01w     2      3                0.0
2752  leslili01w     3      2                3.0
2753  leslili01w     4      0                5.0
2754  leslili01w     5      2                5.0
2755  leslili01w     6      0                7.0
2756  leslili01w     7      2                7.0
2757  leslili01w     8      0                9.0
2758  leslili01w     9      1                9.0
2759  leslili01w    10      0               10.0


In [13]:
teams = pu.team_ratings(sorted_power_ratings)


team = teams[teams['tmID'] == 'HOU']
print(team)

     year tmID  PowerRating playoff  rank  cum_Rating
3       1  HOU     0.335949       Y     2    0.000000
19      2  HOU     0.295997       Y     4    0.335949
35      3  HOU     0.284936       Y     2    0.315973
52      4  HOU     0.306900       Y     2    0.305627
65      5  HOU     0.254759       N     6    0.305945
78      6  HOU     0.272805       Y     3    0.295708
92      7  HOU     0.282086       Y     3    0.291891
105     8  HOU     0.271124       N     5    0.290490
119     9  HOU     0.260530       N     5    0.288069


In [14]:
colleges = pu.teams_colleges(df_new_players,best_colleges,df_teams)

colleges = colleges[colleges['tmID'] == 'IND']

ordered_colleges = colleges.sort_values(by=['year', 'CollegeRank'], ascending=[True, True])

print(ordered_colleges)

   tmID  year  CollegeRank   min  rank
43  IND     1     5.168872  6425     7
44  IND     2     5.448031  6475     6
45  IND     3     4.569650  6425     4
46  IND     4     5.996073  6875     5
47  IND     5     6.759708  6850     6
48  IND     6     4.771264  6925     2
49  IND     7     7.064672  6850     3
50  IND     8     8.205236  6875     2
51  IND     9     6.474676  6950     4
52  IND    10     6.009819  6925     1


Final Table for Testing

In [15]:
merged_data = pu.merge_all_data(df_new_coaches,df_new_teams,df_new_players_teams)
merged_data.drop('coachID',axis = 1, inplace = True)
merged_data = merged_data[merged_data['year'] != 1]

print(merged_data.to_string())
display(merged_data)
print(merged_data.columns)

     year tmID  rank  playoff   o_fgm   o_fga  o_ftm  o_fta  o_3pm  o_3pa  o_oreb  o_dreb   o_reb  o_asts   o_pf  o_stl   o_to  o_blk   o_pts   d_fgm   d_fga  d_ftm  d_fta  d_3pm  d_3pa  d_oreb  d_dreb   d_reb  d_asts   d_pf  d_stl   d_to  d_blk   d_pts   min  team_playoffs_count   Winrate  PO_Winrate  coach_reg_season_wr  coach_po_season_wr  coach_playoffs_count  coach_awards  players_GP  players_GS  players_minutes  players_points  players_oRebounds  players_dRebounds  players_rebounds  players_assists  players_steals  players_blocks  players_turnovers  players_PF  players_fgAttempted  players_fgMade  players_ftAttempted  players_ftMade  players_threeAttempted  players_threeMade  players_dq  players_PostGP  players_PostGS  players_PostMinutes  players_PostPoints  players_PostoRebounds  players_PostdRebounds  players_PostRebounds  players_PostAssists  players_PostSteals  players_PostBlocks  players_PostTurnovers  players_PostPF  players_PostfgAttempted  players_PostfgMade  players_Pos

Unnamed: 0,year,tmID,rank,playoff,o_fgm,o_fga,o_ftm,o_fta,o_3pm,o_3pa,...,players_PostTurnovers,players_PostPF,players_PostfgAttempted,players_PostfgMade,players_PostftAttempted,players_PostftMade,players_PostthreeAttempted,players_PostthreeMade,players_PostDQ,players_player_awards
0,9,ATL,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,18.0,34.0,111.0,42.0,27.0,17.0,49.0,18.0,1.0,2
1,10,ATL,7.0,1,895.0,2258.0,542.0,725.0,202.0,598.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
3,2,CHA,8.0,1,812.0,1903.0,431.0,577.0,131.0,386.0,...,10.0,18.0,27.0,10.0,10.0,6.0,13.0,3.0,0.0,0
4,3,CHA,4.0,1,746.0,1780.0,410.0,528.0,153.0,428.0,...,98.0,132.0,397.0,160.0,84.0,66.0,103.0,39.0,0.0,0
5,4,CHA,2.0,1,770.0,1790.0,490.0,663.0,211.0,527.0,...,22.0,23.0,106.0,39.0,16.0,8.0,40.0,10.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137,6,WAS,4.0,0,873.0,2088.0,474.0,661.0,105.0,283.0,...,26.0,45.0,134.0,54.0,46.0,38.0,10.0,5.0,0.0,1
138,7,WAS,5.0,1,847.0,1968.0,388.0,546.0,181.0,510.0,...,4.0,9.0,34.0,9.0,11.0,8.0,18.0,3.0,0.0,2
139,8,WAS,4.0,0,1016.0,2199.0,528.0,715.0,187.0,522.0,...,29.0,47.0,140.0,48.0,38.0,22.0,32.0,8.0,1.0,2
140,9,WAS,5.0,0,877.0,2170.0,668.0,839.0,163.0,528.0,...,3.0,6.0,4.0,3.0,3.0,2.0,0.0,0.0,0.0,2


Index(['year', 'tmID', 'rank', 'playoff', 'o_fgm', 'o_fga', 'o_ftm', 'o_fta',
       'o_3pm', 'o_3pa', 'o_oreb', 'o_dreb', 'o_reb', 'o_asts', 'o_pf',
       'o_stl', 'o_to', 'o_blk', 'o_pts', 'd_fgm', 'd_fga', 'd_ftm', 'd_fta',
       'd_3pm', 'd_3pa', 'd_oreb', 'd_dreb', 'd_reb', 'd_asts', 'd_pf',
       'd_stl', 'd_to', 'd_blk', 'd_pts', 'min', 'team_playoffs_count',
       'Winrate', 'PO_Winrate', 'coach_reg_season_wr', 'coach_po_season_wr',
       'coach_playoffs_count', 'coach_awards', 'players_GP', 'players_GS',
       'players_minutes', 'players_points', 'players_oRebounds',
       'players_dRebounds', 'players_rebounds', 'players_assists',
       'players_steals', 'players_blocks', 'players_turnovers', 'players_PF',
       'players_fgAttempted', 'players_fgMade', 'players_ftAttempted',
       'players_ftMade', 'players_threeAttempted', 'players_threeMade',
       'players_dq', 'players_PostGP', 'players_PostGS', 'players_PostMinutes',
       'players_PostPoints', 'players_Posto

### **Data Preparation**
We will preparate the data in each table, by cleaning & formatting it so that it can be easily used by the machine learning models afterwards.

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
merged_data['tmID'] = label_encoder.fit_transform(merged_data['tmID'])

x = merged_data.drop('playoff', axis=1)
y = merged_data['playoff']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)


# Create a Decision Tree classifier
clf = DecisionTreeClassifier()

# Train the classifier on the training data
clf.fit(x_train, y_train)

# Make predictions on the test data
predictions = clf.predict(x_test)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

Accuracy: 0.5384615384615384
