## **Machine Learning - WNBA Playoffs Prediction**
This notebook will focus on the undestanding of the data. We will be using SQLite to store the data due to its scalability & the fact that it's a relational schema.

https://docs.python.org/3/library/sqlite3.html

Import sqlite3 and connect to database file

### **Imports**

In [1]:
import pandas as pd
import sqlite3
import prep_utils as pu 
import raw_prep_utils as ru
import sys
import os
import seaborn as sns
import matplotlib.pyplot as plt

### **Database Connection Setup**

In [2]:
db = sqlite3.connect("db/ac.db")
db_cur = db.cursor()

[df_awards, df_coaches, df_players_teams, df_players, df_series_post, df_teams_post, df_teams] = pu.db_to_pandas(db)

In [3]:
oi = ru.merge_all_raw_data(df_teams,df_players_teams,df_coaches,df_awards,df_teams_post,10)
print(oi.to_string())

Dropping Attribute lgID in [1mCoaches[0m...
Creating attribute coach previous regular season win ratio...
Creating attribute coach playoffs win ratio...
Creating attribute coach playoffs count...
Creating attribute coach awards count...

[1mCoaches Null Verification:[0m
year                     0
tmID                     0
coachID                  0
total_reg_season_win     0
total_reg_season_lost    0
total_playoffs_win       0
total_playoffs_lost      0
coach_playoffs_count     0
coach_awards             0
dtype: int64
Dropping Attribute lgID in [1mPlayers_Teams[0m...
Dropping divID in [1mTeams[0m...
Dropping ldID in [1mTeams[0m...
Dropping seeded in [1mTeams[0m...
Dropping tmORB, tmDRB, tmTRB, opptmORB, opptmDRB, opptmTRB in [1mTeams[0m...
Converting Target PLAYOFF to binary on[1mTeams[0m...
Creating attribute winrate [1mTeams[0m...
Creating attribute PlayOffs winrate [1mTeams[0m...
     year tmID franchID confID      rank  playoff firstRound semis finals        

***Prepare Coaches Dataframe***

In [4]:
df_new_coaches = pu.prepare_coaches(df_coaches, df_awards,10)
df_new_coaches = pu.group_coaches(df_new_coaches)
print(df_new_coaches.to_string())

Dropping Attribute lgID in [1mCoaches[0m...
Creating attribute coach previous regular season win ratio...
Creating attribute coach playoffs win ratio...
Creating attribute coach playoffs count...
Creating attribute coach awards count...
Dropping attribute post_wins..
Dropping attribute post_losses..
Dropping attribute won..
Dropping attribute lost..

[1mCoaches Null Verification:[0m
year                    0
tmID                    0
coachID                 0
coach_reg_season_wr     0
coach_po_season_wr      0
coach_playoffs_count    0
coach_awards            0
dtype: int64
     year tmID     coachID  coach_reg_season_wr  coach_po_season_wr  coach_playoffs_count  coach_awards
0       1  CHA  dunntr01wc             0.000000            0.000000                     0             0
1       1  CLE  hugheda99w             0.000000            0.000000                     0             0
2       1  DET  liebena01w             0.000000            0.000000                     0             0

***Prepare Players Dataframe***


In [5]:
df_new_players_teams = pu.prepare_player_teams(df_players_teams,df_awards,10)

Dropping Attribute lgID in [1mPlayers_Teams[0m...
        playerID  year tmID         GP         GS      minutes      points   oRebounds   dRebounds    rebounds     assists     steals      blocks   turnovers          PF  fgAttempted      fgMade  ftAttempted      ftMade  threeAttempted  threeMade        dq     PostGP    PostGS  PostMinutes  PostPoints  PostoRebounds  PostdRebounds  PostRebounds  PostAssists  PostSteals  PostBlocks  PostTurnovers     PostPF  PostfgAttempted  PostfgMade  PostftAttempted  PostftMade  PostthreeAttempted  PostthreeMade    PostDQ  player_awards
0     abrossv01w     2  MIN   0.000000   0.000000     0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   0.000000    0.000000    0.000000    0.000000     0.000000    0.000000     0.000000    0.000000        0.000000   0.000000  0.000000   0.000000  0.000000     0.000000    0.000000       0.000000       0.000000      0.000000     0.000000    0.000000    0.000000       0.000000   0.000000         0.

***Prepare Teams Dataframe***

In [6]:
df_new_teams = pu.prepare_teams(df_teams,df_teams_post,10)
print(df_new_teams.to_string())

Dropping divID in [1mTeams[0m...
Dropping ldID in [1mTeams[0m...
Dropping seeded in [1mTeams[0m...
Dropping tmORB, tmDRB, tmTRB, opptmORB, opptmDRB, opptmTRB in [1mTeams[0m...
Dropping GP, homeW, homeL, awayW, awayL, confW, confL, attend, name, confID, franchID & arena in [1mTeams[0m...
Converting Target PLAYOFF to binary on[1mTeams[0m...
Creating attribute winrate [1mTeams[0m...
Dropping won & lost in [1mTeams[0m...
Creating attribute PlayOffs winrate [1mTeams[0m...
     year tmID      rank  playoff       o_fgm        o_fga       o_ftm       o_fta       o_3pm       o_3pa      o_oreb      o_dreb        o_reb      o_asts        o_pf       o_stl        o_to       o_blk        o_pts        d_fgm        d_fga       d_ftm       d_fta       d_3pm       d_3pa      d_oreb      d_dreb        d_reb      d_asts        d_pf       d_stl        d_to       d_blk        d_pts   min  team_playoffs_count   Winrate  PO_Winrate
0       9  ATL  0.000000        0    0.000000     0.000000  

In [7]:
df_new_player_rankings = pu.prepare_players_for_ranking(df_players_teams, df_awards)
feature_importance, df_new_players = pu.feature_importance_players(df_new_player_rankings, df_players,df_teams)



Mean Squared Error for G: 0.2635078740157481
Feature importance for G:
fg%: 0.14289150262203243
PER: 0.10323143773949953
3pt%: 0.0881539717006011
PPM: 0.08535113464386622
PF: 0.08167360436008597
ft%: 0.07882152218707668
assists: 0.07287893163893389
turnovers: 0.0660071397887569
steals: 0.05525433679372752
dRebounds: 0.0546302752516744
blocks: 0.05443824861382081
rebounds: 0.05263846698505331
oRebounds: 0.05040646290843075
dq: 0.012586400338984005
player_awards: 0.0010365644274564607

Mean Squared Error for C-F: 0.2912
Feature importance for C-F:
blocks: 0.24870923763197172
PER: 0.10598896745244675
fg%: 0.09190354071976109
PPM: 0.08587305330313381
ft%: 0.07712657771741023
turnovers: 0.06867761788170747
steals: 0.06763974566279125
assists: 0.06672220779330319
dRebounds: 0.05267492977865293
oRebounds: 0.046104817128488434
PF: 0.03106615618357105
3pt%: 0.030023627829855716
rebounds: 0.01937525347451239
dq: 0.00811426744239393
player_awards: 0.0

Mean Squared Error for C: 0.241832
Feature 

In [8]:
rank_players_regular = pu.ranking_players(feature_importance, df_new_players)
print('Best players in the regular season: ')
print(rank_players_regular)


Best players in the regular season: 
        playerID  year    rating
1074  jacksla01w     8  0.521932
687   griffyo01w     1  0.518385
204   jacksla01w     4  0.510355
1155  parkeca01w     9  0.500505
1655  leslili01w     5  0.498486
...          ...   ...       ...
1372  gaithka01w     3  0.042772
543   berezva01w     9  0.042010
860   weberma01w     8  0.041959
959   oneilkr01w     9  0.039358
1531  chambco01w     8  0.000000

[1805 rows x 3 columns]


In [9]:
rank_playoff_players = pu.ranking_playoff_players(feature_importance, df_new_players)
print('Best players in the playoffs: ')
print(rank_playoff_players)

Best players in the playoffs: 
        playerID  year  PostRating
395    zollsh01w     9    0.064028
1393  zirkozu01w     4    0.064028
657   zellosh01w    10    0.358125
835    zarafr01w     6    0.360229
1322  zakalok01w     1    0.049758
...          ...   ...         ...
33    abrossv01w     5    0.229230
45    abrossv01w     6    0.071347
59    abrossv01w     7    0.071347
81    abrossv01w     9    0.237899
0     abrossv01w     2    0.071347

[1805 rows x 3 columns]


In [10]:
power_ratings = pu.team_power_rating(df_teams, df_new_players)


sorted_power_ratings = power_ratings.sort_values(by=['year', 'PowerRating'], ascending=[True, False])
print(sorted_power_ratings)
# Group by year and select the top 6 teams
top_teams_by_year = sorted_power_ratings.groupby('year').head(8)

# Count how many of the top 6 teams for each year made the playoffs
playoffs_made_by_year = top_teams_by_year.groupby('year')['playoff'].apply(lambda x: (x == 'Y').sum()).reset_index()

# Print or use the results
for index, row in playoffs_made_by_year.iterrows():
    print('Year ' + str(row['year']) + ' based on Power Ratings ' + str(row['playoff']) + '/8 best teams made the playoffs')

print('Ranking System Accuracy: ' + str(playoffs_made_by_year['playoff'].sum()/ (8*len(playoffs_made_by_year))) + '%')

     year tmID  PowerRating playoff  rank
5       1  LAS     0.346404       Y     1
8       1  NYL     0.346200       Y     1
3       1  HOU     0.340451       Y     2
12      1  SAC     0.328267       Y     3
9       1  ORL     0.317836       Y     3
..    ...  ...          ...     ...   ...
135    10  MIN     0.276832       N     5
136    10  NYL     0.275208       N     7
131    10  CON     0.269684       N     6
130    10  CHI     0.263626       N     5
138    10  SAC     0.237474       N     6

[142 rows x 5 columns]
Year 1 based on Power Ratings 8/8 best teams made the playoffs
Year 2 based on Power Ratings 8/8 best teams made the playoffs
Year 3 based on Power Ratings 8/8 best teams made the playoffs
Year 4 based on Power Ratings 7/8 best teams made the playoffs
Year 5 based on Power Ratings 7/8 best teams made the playoffs
Year 6 based on Power Ratings 7/8 best teams made the playoffs
Year 7 based on Power Ratings 7/8 best teams made the playoffs
Year 8 based on Power Ratings 7

In [11]:
best_colleges = pu.best_colleges(df_players_teams,df_teams,df_players)


print(best_colleges)

                    college  TotalPlayoffAppearances  CollegeRank
88                Tennessee                       21            1
17              Connecticut                       17            2
31                  Georgia                       15            3
86                 Stanford                       12            4
48           Louisiana Tech                       11            5
..                      ...                      ...          ...
80               Seton Hall                        1           14
25    Florida International                        1           14
82     Southern Mississippi                        1           14
50                    Maine                        1           14
0   Academy of Sport Moscow                        1           14

[113 rows x 3 columns]


In [12]:
awards = pu.player_awards(df_new_players,df_awards)


# get a player, order by year
player = awards[awards['playerID'] == 'leslili01w']
player = player.sort_values(by=['year'], ascending=[True])

print(player)


        playerID  year  award  cumulative_awards
2750  leslili01w     1      0                0.0
2751  leslili01w     2      3                0.0
2752  leslili01w     3      2                3.0
2753  leslili01w     4      0                5.0
2754  leslili01w     5      2                5.0
2755  leslili01w     6      0                7.0
2756  leslili01w     7      2                7.0
2757  leslili01w     8      0                9.0
2758  leslili01w     9      1                9.0
2759  leslili01w    10      0               10.0


In [13]:
teams = pu.team_ratings(sorted_power_ratings)


team = teams[teams['tmID'] == 'HOU']
print(team)

     year tmID  PowerRating playoff  rank  cum_Rating
3       1  HOU     0.340451       Y     2    0.000000
19      2  HOU     0.296849       Y     4    0.340451
35      3  HOU     0.292090       Y     2    0.318650
52      4  HOU     0.313439       Y     2    0.309797
65      5  HOU     0.259872       N     6    0.310707
78      6  HOU     0.280001       Y     3    0.300540
92      7  HOU     0.287066       Y     3    0.297117
105     8  HOU     0.273834       N     5    0.295681
119     9  HOU     0.260437       N     5    0.292950


In [14]:
colleges = pu.teams_colleges(df_new_players,best_colleges,df_teams)

colleges = colleges[colleges['tmID'] == 'IND']

ordered_colleges = colleges.sort_values(by=['year', 'CollegeRank'], ascending=[True, True])

print(ordered_colleges)

   tmID  year  CollegeRank   min  rank
43  IND     1     5.168872  6425     7
44  IND     2     5.448031  6475     6
45  IND     3     4.569650  6425     4
46  IND     4     5.996073  6875     5
47  IND     5     6.759708  6850     6
48  IND     6     4.771264  6925     2
49  IND     7     7.064672  6850     3
50  IND     8     8.205236  6875     2
51  IND     9     6.474676  6950     4
52  IND    10     6.009819  6925     1


Final Table for Testing

In [15]:
merged_data = pu.merge_all_data(df_new_coaches,df_new_teams,df_new_players_teams)
merged_data.drop('coachID',axis = 1, inplace = True)
merged_data = merged_data[merged_data['year'] != 1]

display(merged_data)
print(merged_data.columns)

Unnamed: 0,year,tmID,rank,playoff,o_fgm,o_fga,o_ftm,o_fta,o_3pm,o_3pa,...,players_PostTurnovers,players_PostPF,players_PostfgAttempted,players_PostfgMade,players_PostftAttempted,players_PostftMade,players_PostthreeAttempted,players_PostthreeMade,players_PostDQ,players_player_awards
0,9,ATL,0.000000,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,21.058333,35.016667,104.950000,40.616667,36.300000,24.400000,38.308333,12.441667,0.125000,2
1,10,ATL,7.000000,1,895.000000,2258.000000,542.000000,725.000000,202.000000,598.000000,...,25.416667,31.767857,93.928571,36.410714,34.660714,25.779762,32.809524,8.708333,0.166667,4
3,2,CHA,8.000000,1,812.000000,1903.000000,431.000000,577.000000,131.000000,386.000000,...,10.000000,18.000000,27.000000,10.000000,10.000000,6.000000,13.000000,3.000000,0.000000,0
4,3,CHA,6.000000,1,779.000000,1841.500000,420.500000,552.500000,142.000000,407.000000,...,58.500000,88.500000,242.000000,98.000000,55.000000,41.000000,64.500000,22.500000,0.000000,0
5,4,CHA,4.666667,1,776.000000,1824.333333,443.666667,589.333333,165.000000,447.000000,...,46.000000,66.166667,195.833333,79.833333,47.000000,35.166667,51.000000,16.833333,0.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137,6,WAS,5.200000,0,829.200000,1983.400000,390.000000,545.600000,130.400000,396.800000,...,35.750000,60.350000,171.583333,75.533333,59.400000,49.300000,22.950000,8.833333,0.400000,1
138,7,WAS,5.166667,1,832.166667,1980.833333,389.666667,545.666667,138.833333,415.666667,...,48.883333,69.166667,194.300000,83.683333,61.983333,48.516667,55.950000,18.700000,0.750000,2
139,8,WAS,5.000000,0,858.428571,2012.000000,409.428571,569.857143,145.714286,430.857143,...,42.311905,58.526190,177.073810,71.388095,55.326190,42.690476,53.445238,16.947619,0.771429,2
140,9,WAS,5.000000,0,860.750000,2031.750000,441.750000,603.500000,147.875000,443.000000,...,15.948810,25.376190,82.858333,35.883333,26.394048,19.344048,7.182143,2.125000,0.000000,2


Index(['year', 'tmID', 'rank', 'playoff', 'o_fgm', 'o_fga', 'o_ftm', 'o_fta',
       'o_3pm', 'o_3pa', 'o_oreb', 'o_dreb', 'o_reb', 'o_asts', 'o_pf',
       'o_stl', 'o_to', 'o_blk', 'o_pts', 'd_fgm', 'd_fga', 'd_ftm', 'd_fta',
       'd_3pm', 'd_3pa', 'd_oreb', 'd_dreb', 'd_reb', 'd_asts', 'd_pf',
       'd_stl', 'd_to', 'd_blk', 'd_pts', 'min', 'team_playoffs_count',
       'Winrate', 'PO_Winrate', 'coach_reg_season_wr', 'coach_po_season_wr',
       'coach_playoffs_count', 'coach_awards', 'players_GP', 'players_GS',
       'players_minutes', 'players_points', 'players_oRebounds',
       'players_dRebounds', 'players_rebounds', 'players_assists',
       'players_steals', 'players_blocks', 'players_turnovers', 'players_PF',
       'players_fgAttempted', 'players_fgMade', 'players_ftAttempted',
       'players_ftMade', 'players_threeAttempted', 'players_threeMade',
       'players_dq', 'players_PostGP', 'players_PostGS', 'players_PostMinutes',
       'players_PostPoints', 'players_Posto

### **Data Preparation**
We will preparate the data in each table, by cleaning & formatting it so that it can be easily used by the machine learning models afterwards.

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
merged_data['tmID'] = label_encoder.fit_transform(merged_data['tmID'])

x = merged_data.drop('playoff', axis=1)
y = merged_data['playoff']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)


# Create a Decision Tree classifier
clf = DecisionTreeClassifier()

# Train the classifier on the training data
clf.fit(x_train, y_train)

# Make predictions on the test data
predictions = clf.predict(x_test)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

Accuracy: 0.6923076923076923
