## **Machine Learning - WNBA Playoffs Prediction**
This notebook will focus on the undestanding of the data. We will be using SQLite to store the data due to its scalability & the fact that it's a relational schema.

https://docs.python.org/3/library/sqlite3.html

Import sqlite3 and connect to database file

### **Imports**

In [1]:
import pandas as pd
import sqlite3
import prep_utils as pu 
import sys
import os
import seaborn as sns
import matplotlib.pyplot as plt

### **Database Connection Setup**

In [2]:
db = sqlite3.connect("db/ac.db")
db_cur = db.cursor()

[df_awards, df_coaches, df_players_teams, df_players, df_series_post, df_teams_post, df_teams] = pu.db_to_pandas(db)

***Prepare Coaches Dataframe***

In [3]:
#df_new_coaches = pu.prepare_coaches(df_coaches, df_awards,10)
#pu.group_coaches(df_new_coaches)

***Prepare Players Dataframe***


In [4]:
#df_new_players_teams = pu.prepare_player_teams(df_players_teams,df_awards,10)

***Prepare Teams Dataframe***

In [5]:
new_teams = pu.prepare_teams(df_teams,df_teams_post,3)
print(new_teams.to_string())

Dropping divID in [1mTeams[0m...
Dropping ldID in [1mTeams[0m...
Dropping seeded in [1mTeams[0m...
Dropping tmORB, tmDRB, tmTRB, opptmORB, opptmDRB, opptmTRB in [1mTeams[0m...
Dropping GP, homeW, homeL, awayW, awayL, confW, confL, attend, name, confID, franchID & arena in [1mTeams[0m...
Converting Target PLAYOFF to binary on[1mTeams[0m...
Creating attribute winrate [1mTeams[0m...
Dropping won & lost in [1mTeams[0m...
Creating attribute PlayOffs winrate [1mTeams[0m...
     year tmID      rank  playoff        o_fgm        o_fga       o_ftm       o_fta       o_3pm       o_3pa      o_oreb      o_dreb        o_reb      o_asts        o_pf       o_stl        o_to       o_blk        o_pts        d_fgm        d_fga       d_ftm       d_fta       d_3pm       d_3pa      d_oreb      d_dreb        d_reb      d_asts        d_pf       d_stl        d_to       d_blk        d_pts   min   Winrate  PO_Winrate
0       9  ATL  0.000000        0     0.000000     0.000000    0.000000    0.000

In [6]:
df_new_player_rankings = pu.prepare_players_for_ranking(df_players_teams, df_awards)
feature_importance, df_new_players = pu.feature_importance_players(df_new_player_rankings, df_players,df_teams)



Mean Squared Error for G: 0.26681023622047245
Feature importance for G:
fg%: 0.1418847100108889
PER: 0.10256195616217675
3pt%: 0.09377948359370128
PPM: 0.0912850739767747
assists: 0.08012273715839409
PF: 0.07995759097733897
ft%: 0.07518453488610678
turnovers: 0.0619500851699979
dRebounds: 0.056824913504619705
steals: 0.05402685173272586
blocks: 0.05077995151112543
oRebounds: 0.049728005759491205
rebounds: 0.04796239019822781
dq: 0.012086324547026362
player_awards: 0.0018653908114044

Mean Squared Error for C-F: 0.3086111111111111
Feature importance for C-F:
blocks: 0.2033658354676033
PER: 0.12902380082124337
assists: 0.11853249466800848
ft%: 0.08438837397299706
turnovers: 0.08144955899710504
PPM: 0.07813641769215261
fg%: 0.06954122932765137
dRebounds: 0.058572164888324244
steals: 0.057360469641179225
3pt%: 0.03285038348624393
oRebounds: 0.0299625864755354
PF: 0.029520099271169654
rebounds: 0.017956900436128277
dq: 0.009339684854658116
player_awards: 0.0

Mean Squared Error for C: 0.24

In [7]:
rank_players_regular = pu.ranking_players(feature_importance, df_new_players)
print('Best players in the regular season: ')
print(rank_players_regular)


Best players in the regular season: 
        playerID  year    rating
1074  jacksla01w     8  0.514343
204   jacksla01w     4  0.503667
1155  parkeca01w     9  0.501890
1580  catchta01w     3  0.500183
1655  leslili01w     5  0.494630
...          ...   ...       ...
543   berezva01w     9  0.046593
860   weberma01w     8  0.046476
959   oneilkr01w     9  0.046196
934   robincr01w     8  0.035229
1531  chambco01w     8  0.000000

[1805 rows x 3 columns]


In [8]:
rank_playoff_players = pu.ranking_playoff_players(feature_importance, df_new_players)
print('Best players in the playoffs: ')
print(rank_playoff_players)

        playerID  year  stint tmID  lgID  GP  GS  minutes  points  oRebounds  dRebounds  rebounds   assists    steals    blocks  turnovers        PF  fgAttempted  fgMade  ftAttempted  ftMade  threeAttempted  threeMade        dq  PostGP  PostGS  PostMinutes  PostPoints  PostoRebounds  PostdRebounds  PostRebounds  PostAssists  PostSteals  PostBlocks  PostTurnovers    PostPF  PostfgAttempted  PostfgMade  PostftAttempted  PostftMade  PostthreeAttempted  PostthreeMade  PostDQ  player_awards       bioID  pos  firstseason  lastseason  height  weight                        college                 collegeOther   birthDate   deathDate  playoff       PPM       fg%       ft%      3pt%   PostPPM   Postfg%   Postft%  Post3pt%       PER    rating   PostPER
0     abrossv01w     2      0  MIN  WNBA  26  23      846     343   0.265432   0.474638  0.479339  0.224576  0.424242  0.079646   0.674603  0.489510          293     114          132      96              76         19  0.285714       0       0     

In [9]:
power_ratings = pu.team_power_rating(df_teams, df_new_players)



sorted_power_ratings = power_ratings.sort_values(by=['year', 'PowerRating'], ascending=[True, False])
print(sorted_power_ratings)
# Group by year and select the top 6 teams
top_teams_by_year = sorted_power_ratings.groupby('year').head(8)

# Count how many of the top 6 teams for each year made the playoffs
playoffs_made_by_year = top_teams_by_year.groupby('year')['playoff'].apply(lambda x: (x == 'Y').sum()).reset_index()

# Print or use the results
for index, row in playoffs_made_by_year.iterrows():
    print('Year ' + str(row['year']) + ' based on Power Ratings ' + str(row['playoff']) + '/8 best teams made the playoffs')

print('Ranking System Accuracy: ' + str(playoffs_made_by_year['playoff'].sum()/ (8*len(playoffs_made_by_year))) + '%')

     year tmID  PowerRating playoff  rank
5       1  LAS     0.344877       Y     1
8       1  NYL     0.341523       Y     1
3       1  HOU     0.336007       Y     2
12      1  SAC     0.321868       Y     3
9       1  ORL     0.315183       Y     3
..    ...  ...          ...     ...   ...
135    10  MIN     0.276768       N     5
136    10  NYL     0.273556       N     7
131    10  CON     0.268308       N     6
130    10  CHI     0.262895       N     5
138    10  SAC     0.236024       N     6

[142 rows x 5 columns]
Year 1 based on Power Ratings 8/8 best teams made the playoffs
Year 2 based on Power Ratings 8/8 best teams made the playoffs
Year 3 based on Power Ratings 8/8 best teams made the playoffs
Year 4 based on Power Ratings 7/8 best teams made the playoffs
Year 5 based on Power Ratings 7/8 best teams made the playoffs
Year 6 based on Power Ratings 7/8 best teams made the playoffs
Year 7 based on Power Ratings 7/8 best teams made the playoffs
Year 8 based on Power Ratings 8

In [10]:
pu.prepare_teams(df_teams)

TypeError: prepare_teams() missing 2 required positional arguments: 'teams_post' and 'past_years'

In [None]:
pu.best_colleges(df_players_teams,df_teams,df_players)

Unnamed: 0,college,TotalPlayoffAppearances,CollegeRank
88,Tennessee,21,1
17,Connecticut,17,2
31,Georgia,15,3
86,Stanford,12,4
48,Louisiana Tech,11,5
...,...,...,...
80,Seton Hall,1,14
25,Florida International,1,14
82,Southern Mississippi,1,14
50,Maine,1,14


### **Data Preparation**
We will preparate the data in each table, by cleaning & formatting it so that it can be easily used by the machine learning models afterwards.